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Introduction 


This appendix focuses on the GPU—the ubiquitous graphics processing unit 
in every PC, laptop, desktop computer, and workstation. In its most basic form, 
the GPU generates 2D and 3D graphics, images, and video that enable window- 
based operating systems, graphical user interfaces, video games, visual imaging 
applications, and video. The modern GPU that we describe here is a highly 
parallel, highly multithreaded multiprocessor optimized for visual computing. 
To provide real-time visual interaction with computed objects via graphics, 
images, and video, the GPU has a unified graphics and computing architecture 
that serves as both a programmable graphics processor and a scalable parallel 
computing platform. PCs and game consoles combine a GPU with a CPU to form 
heterogeneous systems. 


A Brief History of GPU Evolution 


Fifteen years ago, there was no such thing as a GPU. Graphics on a PC were 
performed by a video graphics array (VGA) controller. A VGA controller was 
simply a memory controller and display generator connected to some DRAM. In 
the 1990s, semiconductor technology advanced sufficiently that more functions 
could be added to the VGA controller. By 1997, VGA controllers were beginning 
to incorporate some three-dimensional (3D) acceleration functions, including 


graphics processing 
unit (GPU) A processor 
optimized for 2D and 3D 
graphics, video, visual 
computing, and display. 


visual computing A mix 
of graphics processing 
and computing that lets 
you visually interact with 
computed objects via 
graphics, images, and 
video. 


heterogeneous system 
A system combining 
different processor types. 
A PC is a heterogeneous 
CPU-GPU system. 
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application programming 


interface (API) A set of 
function and data structure 
definitions providing an 
interface to a library of 
functions. 


hardware for triangle setup and rasterization (dicing triangles into individual 
pixels) and texture mapping and shading (applying “decals” or patterns to pixels 
and blending colors). 

In 2000, the single chip graphics processor incorporated almost every detail 
of the traditional high-end workstation graphics pipeline and therefore, deserved 
a new name beyond VGA controller. The term GPU was coined to denote that 
the graphics device had become a processor. 

Over time, GPUs became more programmable, as programmable processors 
replaced fixed function dedicated logic while maintaining the basic 3D graphics 
pipeline organization. In addition, computations became more precise over time, 
progressing from indexed arithmetic, to integer and fixed point, to single precision 
floating-point, and recently to double precision floating-point. GPUs have become 
massively parallel programmable processors with hundreds of cores and thousands 
of threads. 

Recently, processor instructions and memory hardware were added to support 
general purpose programming languages, and a programming environment was 
created to allow GPUs to be programmed using familiar languages, including C 
and C++. This innovation makes a GPU a fully general-purpose, programmable, 
manycore processor, albeit still with some special benefits and limitations. 


GPU Graphics Trends 


GPUs and their associated drivers implement the OpenGL and DirectX models of 
graphics processing. OpenGL is an open standard for 3D graphics programming 
available for most computers. DirectX is a series of Microsoft multimedia pro- 
gramming interfaces, including Direct3D for 3D graphics. Since these application 
programming interfaces (APIs) have well-defined behavior, it is possible to build 
effective hardware acceleration of the graphics processing functions defined by the 
APIs. This is one of the reasons (in addition to increasing device density) that new 
GPUs are being developed every 12 to 18 months that double the performance of 
the previous generation on existing applications. 

Frequent doubling of GPU performance enables new applications that were 
not previously possible. The intersection of graphics processing and parallel 
computing invites a new paradigm for graphics, known as visual computing. It 
replaces large sections of the traditional sequential hardware graphics pipeline 
model with programmable elements for geometry, vertex, and pixel programs. 
Visual computing in a modern GPU combines graphics processing and parallel 
computing in novel ways that permit new graphics algorithms to be implemented, 
and open the door to entirely new parallel processing applications on pervasive 
high-performance GPUs. 


Heterogeneous System 


Although the GPU is arguably the most parallel and most powerful processor in 
a typical PC, it is certainly not the only processor. The CPU, now multicore and 
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soon to be manycore, is a complementary, primarily serial processor companion 
to the massively parallel manycore GPU. Together, these two types of processors 
comprise a heterogeneous multiprocessor system. 

The best performance for many applications comes from using both the CPU 
and the GPU. This appendix will help you understand how and when to best split 
the work between these two increasingly parallel processors. 


GPU Evolves into Scalable Parallel Processor 


GPUs have evolved functionally from hardwired, limited capability VGA controllers 
to programmable parallel processors. This evolution has proceeded by changing 
the logical (API-based) graphics pipeline to incorporate programmable elements 
and also by making the underlying hardware pipeline stages less specialized and 
more programmable. Eventually, it made sense to merge disparate programmable 
pipeline elements into one unified array of many programmable processors. 

In the GeForce 8-series generation of GPUs, the geometry, vertex, and pixel 
processing all run on the same type of processor. This unification allows for 
dramatic scalability. More programmable processor cores increase the total system 
throughput. Unifying the processors also delivers very effective load balancing, 
since any processing function can use the whole processor array. At the other end 
of the spectrum, a processor array can now be built with very few processors, since 
all of the functions can be run on the same processors. 


Why CUDA and GPU Computing? 


This uniform and scalable array of processors invites a new model of programming 
for the GPU. The large amount of floating-point processing power in the GPU 
processor array is very attractive for solving nongraphics problems. Given the large 
degree of parallelism and the range of scalability of the processor array for graphics 
applications, the programming model for more general computing must express 
the massive parallelism directly, but allow for scalable execution. 

GPU computing is the term coined for using the GPU for computing via a 
parallel programming language and API, without using the traditional graphics 
API and graphics pipeline model. This is in contrast to the earlier General Purpose 
computation on GPU (GPGPU) approach, which involves programming the GPU 
using a graphics API and graphics pipeline to perform nongraphics tasks. 

Compute Unified Device Architecture (CUDA) is a scalable parallel program- 
ming model and software platform for the GPU and other parallel processors that 
allows the programmer to bypass the graphics API and graphics interfaces of the 
GPU and simply program in C or C++. The CUDA programming model has an 
SPMD (single-program multiple data) software style, in which a programmer 
writes a program for one thread that is instanced and executed by many threads 
in parallel on the multiple processors of the GPU. In fact, CUDA also provides a 
facility for programming multiple CPU cores as well, so CUDA is an environment 
for writing parallel programs for the entire heterogeneous computer system. 


GPU computing Using 
a GPU for computing via 
a parallel programming 
language and API. 


GPGPU Using a GPU 
for general-purpose 
computation via a 
traditional graphics API 
and graphics pipeline. 


CUDA A scalable parallel 
programming model 

and language based on 
C/C++. It is a parallel 
programming platform 
for GPUs and multicore 
CPUs. 
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GPU Unifies Graphics and Computing 


With the addition of CUDA and GPU computing to the capabilities of the GPU, 
it is now possible to use the GPU as both a graphics processor and a computing 
processor at the same time, and to combine these uses in visual computing 
applications. The underlying processor architecture of the GPU is exposed in two 
ways: first, as implementing the programmable graphics APIs, and second, as a 
massively parallel processor array programmable in C/C++ with CUDA. 

Although the underlying processors of the GPU are unified, it is not necessary 
that all of the SPMD thread programs are the same. The GPU can run graphics 
shader programs for the graphics aspect of the GPU, processing geometry, vertices, 
and pixels, and also run thread programs in CUDA. 

The GPU is truly a versatile multiprocessor architecture, supporting a variety of 
processing tasks. GPUs are excellent at graphics and visual computing as they were 
specifically designed for these applications. GPUs are also excellent at many general- 
purpose throughput applications that are “first cousins” of graphics, in that they 
perform a lot of parallel work, as well as having a lot of regular problem structure. 
In general, they are a good match to data-parallel problems (see Chapter 7), 
particularly large problems, but less so for less regular, smaller problems. 


GPU Visual Computing Applications 


Visual computing includes the traditional types of graphics applications plus many 
new applications. The original purview of a GPU was “anything with pixels,” but it 
now includes many problems without pixels but with regular computation and/or 
data structure. GPUs are effective at 2D and 3D graphics, since that is the purpose 
for which they are designed. Failure to deliver this application performance would 
be fatal. 2D and 3D graphics use the GPU in its “graphics mode,” accessing the pro- 
cessing power of the GPU through the graphics APIs, OpenGL™, and DirectX™, 
Games are built on the 3D graphics processing capability. 

Beyond 2D and 3D graphics, image processing and video are important applica- 
tions for GPUs. These can be implemented using the graphics APIs or as compu- 
tational programs, using CUDA to program the GPU in computing mode. Using 
CUDA, image processing is simply another data-parallel array program. To the 
extent that the data access is regular and there is good locality, the program will 
be efficient. In practice, image processing is a very good application for GPUs. 
Video processing, especially encode and decode (compression and decompression 
according to some standard algorithms) is quite efficient. 

The greatest opportunity for visual computing applications on GPUs is to “break 
the graphics pipeline.” Early GPUs implemented only specific graphics APIs, albeit 
at very high performance. This was wonderful if the API supported the operations 
that you wanted to do. If not, the GPU could not accelerate your task, because early 
GPU functionality was immutable. Now, with the advent of GPU computing and 
CUDA, these GPUs can be programmed to implement a different virtual pipeline 
by simply writing a CUDA program to describe the computation and data flow 
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that is desired. So, all applications are now possible, which will stimulate new visual 
computing approaches. 


GPU System Architectures 


In this section, we survey GPU system architectures in common use today. We 
discuss system configurations, GPU functions and services, standard programming 
interfaces, and a basic GPU internal architecture. 


Heterogeneous CPU-GPU System Architecture 


A heterogeneous computer system architecture using a GPU and a CPU can be 
described at a high level by two primary characteristics: first, how many functional 
subsystems and/or chips are used and what are their interconnection technologies 
and topology; and second, what memory subsystems are available to these functional 
subsystems. See Chapter 6 for background on the PC I/O systems and chip sets. 


The Historical PC (circa 1990) 


Figure A.2.1 is a high-level block diagram of a legacy PC, circa 1990. The north 
bridge (see Chapter 6) contains high-bandwidth interfaces, connecting the CPU, 
memory, and PCI bus. The south bridge contains legacy interfaces and devices: 
ISA bus (audio, LAN), interrupt controller; DMA controller; time/counter. In 
this system, the display was driven by a simple framebuffer subsystem known 
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FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. 
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PCI-Express (PCIe) 

A standard system I/O 
interconnect that uses 
point-to-point links. 
Links have a configurable 
number of lanes and 
bandwidth. 


as a VGA (video graphics array) which was attached to the PCI bus. Graphics 
subsystems with built-in processing elements (GPUs) did not exist in the PC 
landscape of 1990. 

Figure A.2.2 illustrates two configurations in common use today. These are 
characterized by a separate GPU (discrete GPU) and CPU with respective memory 
subsystems. In Figure A.2.2a, with an Intel CPU, we see the GPU attached via a 
16-lane PCI-Express 2.0 link to provide a peak 16 GB/s transfer rate, (peak of 
8 GB/s in each direction). Similarly, in Figure A.2.2b, with an AMD CPU, the GPU 
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FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of 
the components and interconnects in this figure. 
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is attached to the chipset, also via PCI-Express with the same available bandwidth. 
In both cases, the GPUs and CPUs may access each other’s memory, albeit with 
less available bandwidth than their access to the more directly attached memories. 
In the case of the AMD system, the north bridge or memory controller is integrated 
into the same die as the CPU. 

A low-cost variation on these systems, a unified memory architecture (UMA) 
system, uses only CPU system memory, omitting GPU memory from the system. 
These systems have relatively low performance GPUs, since their achieved 
performance is limited by the available system memory bandwidth and increased 
latency of memory access, whereas dedicated GPU memory provides high 
bandwidth and low latency. 

A high performance system variation uses multiple attached GPUs, typically 
two to four working in parallel, with their displays daisy-chained. An example is 
the NVIDIA SLI (scalable link interconnect) multi-GPU system, designed for high 
performance gaming and workstations. 

The next system category integrates the GPU with the north bridge (Intel) or 
chipset (AMD) with and without dedicated graphics memory. 

Chapter 5 explains how caches maintain coherence in a shared address space. 
With CPUs and GPUs, there are multiple address spaces. GPUs can access their 
own physical local memory and the CPU system’s physical memory using virtual 
addresses that are translated by an MMU on the GPU. The operating system kernel 
manages the GPU’s page tables. A system physical page can be accessed using either 
coherent or noncoherent PCI-Express transactions, determined by an attribute 
in the GPU’s page table. The CPU can access GPU’s local memory through an 
address range (also called aperture) in the PCI-Express address space. 


Game Consoles 


Console systems such as the Sony PlayStation 3 and the Microsoft Xbox 360 
resemble the PC system architectures previously described. Console systems 
are designed to be shipped with identical performance and functionality over 
a lifespan that can last five years or more. During this time, a system may be 
reimplemented many times to exploit more advanced silicon manufacturing 
processes and thereby to provide constant capability at ever lower costs. Console 
systems do not need to have their subsystems expanded and upgraded the way PC 
systems do, so the major internal system buses tend to be customized rather than 
standardized. 


GPU Interfaces and Drivers 


In a PC today, GPUs are attached to a CPU via PCI-Express. Earlier generations 
used AGP. Graphics applications call OpenGL [Segal and Akeley, 2006] or 
Direct3D [Microsoft DirectX Specification] API functions that use the GPU as 
a coprocessor. The APIs send commands, programs, and data to the GPU via a 
graphics device driver optimized for the particular GPU. 


unified memory 
architecture (UMA) 

A system architecture in 
which the CPU and GPU 
share a common system 
memory. 


AGP An extended 
version of the original PCI 
T/O bus, which provided 
up to eight times the 
bandwidth of the original 
PCI bus to a single card 
slot. Its primary purpose 
was to connect graphics 
subsystems into PC 
systems. 
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Graphics Logical Pipeline 


The graphics logical pipeline is described in Section A.3. Figure A.2.3 illustrates 
the major processing stages, and highlights the important programmable stages 
(vertex, geometry, and pixel shader stages). 


Input 
Assembler 


Vertex Geometry 
Shader _ 


Setup & Pixel Raster Operations/ 
_ Shader Rasterizer Shader Output Merger 


FIGURE A.2.3 Graphics logical pipeline. Programmable graphics shader stages are blue, and fixed-function blocks are white. 


Mapping Graphics Pipeline to Unified GPU Processors 


Figure A.2.4 shows how the logical pipeline comprising separate independent 
programmable stages is mapped onto a physical distributed array of processors. 


Basic Unified GPU Architecture 


Unified GPU architectures are based on a parallel array of many programmable 
processors. They unify vertex, geometry, and pixel shader processing and parallel 
computing on the same processors, unlike earlier GPUs which had separate 
processors dedicated to each processing type. The programmable processor array is 
tightly integrated with fixed function processors for texture filtering, rasterization, 
raster operations, anti-aliasing, compression, decompression, display, video 
decoding, and high-definition video processing. Although the fixed-function 
processors significantly outperform more general programmable processors in 
terms of absolute performance constrained by an area, cost, or power budget, we 
will focus on the programmable processors here. 

Compared with multicore CPUs, manycore GPUs have a different architectural 
design point, one focused on executing many parallel threads efficiently on many 
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FIGURE A.2.4 Logical pipeline mapped to physical processors. The programmable shader 
stages execute on the array of unified processors, and the logical graphics pipeline dataflow recirculates 
through the processors. 
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processor cores. By using many simpler cores and optimizing for data-parallel 
behavior among groups of threads, more of the per-chip transistor budget is 
devoted to computation, and less to on-chip caches and overhead. 


Processor Array 


A unified GPU processor array contains many processor cores, typically organized 
into multithreaded multiprocessors. Figure A.2.5 shows a GPU with an array of 
112 streaming processor (SP) cores, organized as 14 multithreaded streaming 
multiprocessors (SM). Each SP core is highly multithreaded, managing 96 
concurrent threads and their state in hardware. The processors connect with 
four 64-bit-wide DRAM partitions via an interconnection network. Each SM 
has eight SP cores, two special function units (SFUs), instruction and constant 
caches, a multithreaded instruction unit, and a shared memory. This is the basic 
Tesla architecture implemented by the NVIDIA GeForce 8800. It has a unified 
architecture in which the traditional graphics programs for vertex, geometry, and 
pixel shading run on the unified SMs and their SP cores, and computing programs 
run on the same processors. 
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FIGURE A.2.5 Basic unified GPU architecture. Fxample GPU with 112 streaming processor (SP) cores organized in 14 streaming 
multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800. The processors 
connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units 
(SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. 
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The processor array architecture is scalable to smaller and larger GPU configu- 
rations by scaling the number of multiprocessors and the number of memory 
partitions. Figure A.2.5 shows seven clusters of two SMs sharing a texture unit and 
a texture L1 cache. The texture unit delivers filtered results to the SM given a set of 
coordinates into a texture map. Because filter regions of support often overlap for 
successive texture requests, a small streaming L1 texture cache is effective to reduce 
the number of requests to the memory system. The processor array connects with 
raster operation (ROP) processors, L2 texture caches, external DRAM memories, 
and system memory via a GPU-wide interconnection network. The number of 
processors and number of memories can scale to design balanced GPU systems for 
different performance and market segments. 


Programming GPUs 


Programming multiprocessor GPUs is qualitatively different than programming 
other multiprocessors like multicore CPUs. GPUs provide two to three orders 
of magnitude more thread and data parallelism than CPUs, scaling to hundreds 
of processor cores and tens of thousands of concurrent threads in 2008. GPUs 
continue to increase their parallelism, doubling it about every 12 to 18 months, 
enabled by Moore’s law [1965] of increasing integrated circuit density and by 
improving architectural efficiency. To span the wide price and performance range 
of different market segments, different GPU products implement widely varying 
numbers of processors and threads. Yet users expect games, graphics, imaging, 
and computing applications to work on any GPU, regardless of how many parallel 
threads it executes or how many parallel processor cores it has, and they expect 
more expensive GPUs (with more threads and cores) to run applications faster. 
As a result, GPU programming models and application programs are designed to 
scale transparently to a wide range of parallelism. 

The driving force behind the large number of parallel threads and cores in a 
GPU is real-time graphics performance—the need to render complex 3D scenes 
with high resolution at interactive frame rates, at least 60 frames per second. 
Correspondingly, the scalable programming models of graphics shading languages 
such as Cg (C for graphics) and HLSL (high-level shading language) are designed 
to exploit large degrees of parallelism via many independent parallel threads and to 
scale to any number of processor cores. The CUDA scalable parallel programming 
model similarly enables general parallel computing applications to leverage large 
numbers of parallel threads and scale to any number of parallel processor cores, 
transparently to the application. 

In these scalable programming models, the programmer writes code for a single 
thread, and the GPU runs myriad thread instances in parallel. Programs thus scale 
transparently over a wide range of hardware parallelism. This simple paradigm 
arose from graphics APIs and shading languages that describe how to shade one 
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vertex or one pixel. It has remained an effective paradigm as GPUs have rapidly 
increased their parallelism and performance since the late 1990s. 

This section briefly describes programming GPUs for real-time graphics 
applications using graphics APIs and programming languages. It then describes 
programming GPUs for visual computing and general parallel computing 
applications using the C language and the CUDA programming model. 


Programming Real-Time Graphics 


APIs have played an important role in the rapid, successful development of GPUs 
and processors. There are two primary standard graphics APIs: OpenGL and 
Direct3D, one of the Microsoft DirectX multimedia programming interfaces. 
OpenGL, an open standard, was originally proposed and defined by Silicon 
Graphics Incorporated. The ongoing development and extension of the OpenGL 
standard [Segal and Akeley, 2006], [Kessenich, 2006] is managed by Khronos, 
an industry consortium. Direct3D [Blythe, 2006], a de facto standard, is defined 
and evolved forward by Microsoft and partners. OpenGL and Direct3D are 
similarly structured, and continue to evolve rapidly with GPU hardware advances. 
They define a logical graphics processing pipeline that is mapped onto the GPU 
hardware and processors, along with programming models and languages for the 
programmable pipeline stages. 


Logical Graphics Pipeline 


Figure A.3.1 illustrates the Direct3D 10 logical graphics pipeline. OpenGL has a 
similar graphics pipeline structure. The API and logical pipeline provide a streaming 
dataflow infrastructure and plumbing for the programmable shader stages, shown in 
blue. The 3D application sends the GPU a sequence of vertices grouped into geometric 
primitives—points, lines, triangles, and polygons. The input assembler collects 
vertices and primitives. The vertex shader program executes per-vertex processing, 


OpenGL An open- 
standard graphics API. 


Direct3D A graphics API 
defined by Microsoft and 
partners. 


FIGURE A.3.1 Direct3D 10 graphics pipeline. Each logical pipeline stage maps to GPU hardware or to a GPU processor. Programmable 
shader stages are blue, fixed-function blocks are white, and memory objects are grey. Each stage processes a vertex, geometric primitive, or pixel 


in a streaming dataflow fashion. 
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texture A 1D, 2D, or 

3D array that supports 
sampled and filtered 
lookups with interpolated 
coordinates. 


shader A program that 
operates on graphics data 
such as a vertex or a pixel 
fragment. 


shading language 

A graphics rendering 
language, usually having 
a dataflow or streaming 
programming model. 


including transforming the vertex 3D position into a screen position and lighting the 
vertex to determine its color. The geometry shader program executes per-primitive 
processing and can add or drop primitives. The setup and rasterizer unit generates 
pixel fragments (fragments are potential contributions to pixels) that are covered by 
a geometric primitive. The pixel shader program performs per-fragment processing, 
including interpolating per-fragment parameters, texturing, and coloring. Pixel 
shaders make extensive use of sampled and filtered lookups into large 1D, 2D, or 
3D arrays called textures, using interpolated floating-point coordinates. Shaders use 
texture accesses for maps, functions, decals, images, and data. The raster operations 
processing (or output merger) stage performs Z-buffer depth testing and stencil 
testing, which may discard a hidden pixel fragment or replace the pixel’s depth with 
the fragment’s depth, and performs a color blending operation that combines the 
fragment color with the pixel color and writes the pixel with the blended color. 

The graphics API and graphics pipeline provide input, output, memory objects, 
and infrastructure for the shader programs that process each vertex, primitive, and 
pixel fragment. 


Graphics Shader Programs 


Real-time graphics applications use many different shader programs to model 
how light interacts with different materials and to render complex lighting and 
shadows. Shading languages are based on a dataflow or streaming programming 
model that corresponds with the logical graphics pipeline. Vertex shader programs 
map the position of triangle vertices onto the screen, altering their position, color, 
or orientation. Typically a vertex shader thread inputs a floating-point (x, y, z, w) 
vertex position and computes a floating-point (x, y, z) screen position. Geometry 
shader programs operate on geometric primitives (such as lines and triangles) 
defined by multiple vertices, changing them or generating additional primitives. 
Pixel fragment shaders each “shade” one pixel, computing a floating-point red, 
green, blue, alpha (RGBA) color contribution to the rendered image at its pixel 
sample (x, y) image position. Shaders (and GPUs) use floating-point arithmetic 
for all pixel color calculations to eliminate visible artifacts while computing the 
extreme range of pixel contribution values encountered while rendering scenes with 
complex lighting, shadows, and high dynamic range. For all three types of graphics 
shaders, many program instances can be run in parallel, as independent parallel 
threads, because each works on independent data, produces independent results, 
and has no side effects. Independent vertices, primitives, and pixels further enable 
the same graphics program to run on differently sized GPUs that process different 
numbers of vertices, primitives, and pixels in parallel. Graphics programs thus scale 
transparently to GPUs with different amounts of parallelism and performance. 
Users program all three logical graphics threads with a common targeted high- 
level language. HLSL (high-level shading language) and Cg (C for graphics) are 
commonly used. They have C-like syntax and a rich set of library functions for 
matrix operations, trigonometry, interpolation, and texture access and filtering, 
but are far from general computing languages: they currently lack general memory 
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access, pointers, file I/O, and recursion. HLSL and Cg assume that programs live 
within a logical graphics pipeline, and thus I/O is implicit. For example, a pixel 
fragment shader may expect the geometric normal and multiple texture coordinates 
to have been interpolated from vertex values by upstream fixed-function stages and 
can simply assign a value to the COLOR output parameter to pass it downstream 
to be blended with a pixel at an implied (x, y) position. 

The GPU hardware creates a new independent thread to execute a vertex, geometry, 
or pixel shader program for every vertex, every primitive, and every pixel fragment. In 
video games, the bulk of threads execute pixel shader programs, as there are typically 
10 to 20 times or more pixel fragments than vertices, and complex lighting and 
shadows require even larger ratios of pixel to vertex shader threads. The graphics 
shader programming model drove the GPU architecture to efficiently execute 
thousands of independent fine-grained threads on many parallel processor cores. 


Pixel Shader Example 


Consider the following Cg pixel shader program that implements the “environment 
mapping” rendering technique. For each pixel thread, this shader is passed five 
parameters, including 2D floating-point texture image coordinates needed to 
sample the surface color, and a 3D floating-point vector giving the reflection of 
the view direction off the surface. The other three “uniform” parameters do not 
vary from one pixel instance (thread) to the next. The shader looks up color in 
two texture images: a 2D texture access for the surface color, and a 3D texture 
access into a cube map (six images corresponding to the faces of a cube) to obtain 
the external world color corresponding to the reflection direction. Then the final 
four-component (red, green, blue, alpha) floating-point color is computed using a 
weighted average called a “lerp” or linear interpolation function. 


void reflection( 


float2 texCoord : TEXCOORDO, 
float3 reflection_dir : TEXCOORDI, 
out float4 color : COLOR, 
uniform float shiny, 


uniform sampler2D surfaceMap, 
uniform samplerCUBE envMap) 


// Fetch the surface color from a texture 
float4 surfaceColor = tex2D(surfaceMap, texCoord); 


// Fetch reflected color by sampling a cube map 
float4 reflectedColor = texCUBE(environmentMap, reflection_dir); 


// Output is weighted average of the two colors 
color = lerp(surfaceColor, reflectedColor, shiny); 
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Although this shader program is only three lines long, it activates a lot of GPU 
hardware. For each texture fetch, the GPU texture subsystem makes multiple 
memory accesses to sample image colors in the vicinity of the sampling coordinates, 
and then interpolates the final result with floating-point filtering arithmetic. The 
multithreaded GPU executes thousands of these lightweight Cg pixel shader threads 
in parallel, deeply interleaving them to hide texture fetch and memory latency. 

Cg focuses the programmer’s view to a single vertex or primitive or pixel, which 
the GPU implements as a single thread; the shader program transparently scales to 
exploit thread parallelism on the available processors. Being application-specific, Cg 
provides a rich set of useful data types, library functions, and language constructs 
to express diverse rendering techniques. 

Figure A.3.2 shows skin rendered by a fragment pixel shader. Real skin appears 
quite different from flesh-color paint because light bounces around a lot before 
re-emerging. In this complex shader, three separate skin layers, each with unique 
subsurface scattering behavior, are modeled to give the skin a visual depth and 
translucency. Scattering can be modeled by a blurring convolution in a flattened 
“texture” space, with red being blurred more than green, and blue blurred less. 


FIGURE A.3.2 GPU-rendered image. To give the skin visual depth and translucency, the pixel shader 
program models three separate skin layers, each with unique subsurface scattering behavior. It executes 1400 
instructions to render the red, green, blue, and alpha color components of each skin pixel fragment. 
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The compiled Cg shader executes 1400 instructions to compute the color of one 
skin pixel. 

As GPUs have evolved superior floating-point performance and very high 
streaming memory bandwidth for real-time graphics, they have attracted highly 
parallel applications beyond traditional graphics. At first, access to this power 
was available only by couching an application as a graphics-rendering algorithm, 
but this GPGPU approach was often awkward and limiting. More recently, 
the CUDA programming model has provided a far easier way to exploit the 
scalable high-performance floating-point and memory bandwidth of GPUs with 
the C programming language. 


Programming Parallel Computing Applications 


CUDA, Brook, and CAL are programming interfaces for GPUs that are focused on 
data parallel computation rather than on graphics. CAL (Compute Abstraction 
Layer) is a low-level assembler language interface for AMD GPUs. Brook is a 
streaming language adapted for GPUs by Buck, et. al. [2004]. CUDA, developed by 
NVIDIA [2007], is an extension to the C and C++ languages for scalable parallel 
programming of manycore GPUs and multicore CPUs. The CUDA programming 
model is described below, adapted from an article by Nickolls, Buck, Garland, and 
Skadron [2008]. 

With the new model the GPU excels in data parallel and throughput computing, 
executing high performance computing applications as well as graphics applications. 


Data Parallel Problem Decomposition 


To map large computing problems effectively to a highly parallel processing 
architecture, the programmer or compiler decomposes the problem into many 
small problems that can be solved in parallel. For example, the programmer par- 
titions a large result data array into blocks and further partitions each block into 
elements, such that the result blocks can be computed independently in parallel, 
and the elements within each block are computed in parallel. Figure A.3.3 shows 
a decomposition of a result data array into a 3 x 2 grid of blocks, where each 
block is further decomposed into a 5 x 3 array of elements. The two-level parallel 
decomposition maps naturally to the GPU architecture: parallel multiprocessors 
compute result blocks, and parallel threads compute result elements. 

The programmer writes a program that computes a sequence of result data 
grids, partitioning each result grid into coarse-grained result blocks that can be 
computed independently in parallel. The program computes each result block with 
an array of fine-grained parallel threads, partitioning the work among threads so 
that each computes one or more result elements. 


Scalable Parallel Programming with CUDA 


The CUDA scalable parallel programming model extends the C and C++ 
languages to exploit large degrees of parallelism for general applications on highly 
parallel multiprocessors, particularly GPUs. Early experience with CUDA shows 
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Sequence 


Step 1: 


Step 2: 


FIGURE A.3.3 Decomposing result data into a grid of blocks of elements to be computed 
in parallel. 


that many sophisticated programs can be readily expressed with a few easily 
understood abstractions. Since NVIDIA released CUDA in 2007, developers have 
rapidly developed scalable parallel programs for a wide range of applications, 
including seismic data processing, computational chemistry, linear algebra, sparse 
matrix solvers, sorting, searching, physics models, and visual computing. These 
applications scale transparently to hundreds of processor cores and thousands of 
concurrent threads. NVIDIA GPUs with the Tesla unified graphics and computing 
architecture (described in sections A.4 and A.7) run CUDA C programs, and are 
widely available in laptops, PCs, workstations, and servers. The CUDA model is 
also applicable to other shared memory parallel processing architectures, including 
multicore CPUs [Stratton, 2008]. 

CUDA provides three key abstractions—a hierarchy of thread groups, shared 
memories, and barrier synchronization—that provide a clear parallel structure to con- 
ventional C code for one thread of the hierarchy. Multiple levels of threads, memory, 
and synchronization provide fine-grained data parallelism and thread parallelism, 
nested within coarse-grained data parallelism and task parallelism. The abstractions 
guide the programmer to partition the problem into coarse subproblems that can 
be solved independently in parallel, and then into finer pieces that can be solved in 
parallel. The programming model scales transparently to large numbers of proces- 
sor cores: a compiled CUDA program executes on any number of processors, and 
only the runtime system needs to know the physical processor count. 
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The CUDA Paradigm 


CUDA is a minimal extension of the C and C++ programming languages. The 
programmer writes a serial program that calls parallel kernels, which may be simple 
functions or full programs. A kernel executes in parallel across a set of parallel 
threads. The programmer organizes these threads into a hierarchy of thread blocks 
and grids of thread blocks. A thread block is a set of concurrent threads that can 
cooperate among themselves through barrier synchronization and through shared 
access to a memory space private to the block. A grid is a set of thread blocks that 
may each be executed independently and thus may execute in parallel. 

When invoking a kernel, the programmer specifies the number of threads per 
block and the number of blocks comprising the grid. Each thread is given a unique 
thread ID number thread!dx within its thread block, numbered 0,1, 2, ..., 
blockDim-1, and each thread block is given a unique block ID number b1 ockIdx 
within its grid. CUDA supports thread blocks containing up to 512 threads. For 
convenience, thread blocks and grids may have 1, 2, or 3 dimensions, accessed via 
.X, .y, and .z index fields. 

As a very simple example of parallel programming, suppose that we are given 
two vectors x and y of n floating-point numbers each and that we wish to compute 
the result of y= ax + y for some scalar value a. This is the so-called SAXPY kernel 
defined by the BLAS linear algebra library. Figure A.3.4 shows C code for perform- 
ing this computation on both a serial processor and in parallel using CUDA. 

The __global__ declaration specifier indicates that the procedure is a kernel 
entry point. CUDA programs launch parallel kernels with the extended function 
call syntax: 


kernel<<<dimGrid, dimBlock>>>(... parameter list ...); 


where dimGrid and dimB1ock are three-element vectors of type dim3 that specify 
the dimensions of the grid in blocks and the dimensions of the blocks in threads, 
respectively. Unspecified dimensions default to one. 

In Figure A.3.4, we launch a grid of n threads that assigns one thread to each 
element of the vectors and puts 256 threads in each block. Each individual thread 
computes an element index from its thread and block IDs and then performs the 
desired calculation on the corresponding vector elements. Comparing the serial and 
parallel versions of this code, we see that they are strikingly similar. This represents 
a fairly common pattern. The serial code consists of a loop where each iteration is 
independent of all the others. Such loops can be mechanically transformed into 
parallel kernels: each loop iteration becomes an independent thread. By assigning 
a single thread to each output element, we avoid the need for any synchronization 
among threads when writing results to memory. 

The text of a CUDA kernel is simply a C function for one sequential thread. 
Thus, it is generally straightforward to write and is typically simpler than writing 
parallel code for vector operations. Parallelism is determined clearly and explicitly 
by specifying the dimensions of a grid and its thread blocks when launching a 
kernel. 


kernel A program or 
function for one thread, 
designed to be executed 
by many threads. 


thread block A set 

of concurrent threads 
that execute the same 
thread program and may 
cooperate to compute a 
result. 


grid A set of thread 
blocks that execute the 
same kernel program. 
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synchronization barrier 
Threads wait at a synchro- 
nization barrier until 

all threads in the thread 
block arrive at the barrier. 


Computing y = ax + y with a serial loop: 


void saxpy_serial(int n, float alpha, float *x, float *y) 
{ 
for(int i = 0; i<n; ++i) 
yli] = alpha*xli] + yli]; 
} 


// Invoke serial SAXPY kernel 
saxpy_serial(n, 2.0, x, y); 


Computing y = ax + y in parallel using CUDA: 


— global__ 
void saxpy_parallel(int n, float alpha, float *x, float *y) 
{ 


int i = blockIdx.x*blockDim.x + threadIdx.x; 
if( i<n ) y[i] = alpha*x[i] + ylid; 
} 


// Invoke parallel SAXPY kernel (256 threads per block) 
int nblocks = (n + 255) / 256; 
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); 


FIGURE A.3.4 Sequential code (top) in C versus parallel code (bottom) in CUDA for SAXPY 
(see Chapter 7). CUDA parallel threads replace the C serial loop—each thread computes the same result as 
one loop iteration. The parallel code computes n results with n threads organized in blocks of 256 threads. 


Parallel execution and thread management is automatic. All thread creation, 
scheduling, and termination is handled for the programmer by the underlying sys- 
tem. Indeed, a Tesla architecture GPU performs all thread management directly in 
hardware. The threads of a block execute concurrently and may synchronize at a 
synchronization barrier by calling the __syncthreads() intrinsic. This guar- 
antees that no thread in the block can proceed until all threads in the block have 
reached the barrier. After passing the barrier, these threads are also guaranteed to 
see all writes to memory performed by threads in the block before the barrier. 
Thus, threads in a block may communicate with each other by writing and reading 
per-block shared memory at a synchronization barrier. 

Since threads in a block may share memory and synchronize via barriers, they 
will reside together on the same physical processor or multiprocessor. The number 
of thread blocks can, however, greatly exceed the number of processors. The CUDA 
thread programming model virtualizes the processors and gives the programmer the 
flexibility to parallelize at whatever granularity is most convenient. Virtualization 
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into threads and thread blocks allows intuitive problem decompositions, as the 
number of blocks can be dictated by the size of the data being processed rather 
than by the number of processors in the system. It also allows the same CUDA 
program to scale to widely varying numbers of processor cores. 

To manage this processing element virtualization and provide scalability, CUDA 
requires that thread blocks be able to execute independently. It must be possible to 
execute blocks in any order, in parallel or in series. Different blocks have no means of 
direct communication, although they may coordinate their activities using atomic 
memory operations on the global memory visible to all threads—by atomically 
incrementing queue pointers, for example. This independence requirement allows 
thread blocks to be scheduled in any order across any number of cores, making 
the CUDA model scalable across an arbitrary number of cores as well as across a 
variety of parallel architectures. It also helps to avoid the possibility of deadlock. 
An application may execute multiple grids either independently or dependently. 
Independent grids may execute concurrently, given sufficient hardware resources. 
Dependent grids execute sequentially, with an implicit interkernel barrier between 
them, thus guaranteeing that all blocks of the first grid complete before any block 
of the second, dependent grid begins. 

Threads may access data from multiple memory spaces during their execution. 
Each thread has a private local memory. CUDA uses local memory for thread- 
private variables that do not fit in the thread’s registers, as well as for stack frames 
and register spilling. Each thread block has a shared memory, visible to all threads 
of the block, which has the same lifetime as the block. Finally, all threads have 
access to the same global memory. Programs declare variables in shared and 
global memory with the __shared__ and __device__ type qualifiers. On a 
Tesla architecture GPU, these memory spaces correspond to physically separate 
memories: per-block shared memory is a low-latency on-chip RAM, while global 
memory resides in the fast DRAM on the graphics board. 

Shared memory is expected to be a low-latency memory near each processor, 
much like an L1 cache. It can therefore provide high-performance communication 
and data sharing among the threads of a thread block. Since it has the same lifetime 
as its corresponding thread block, kernel code will typically initialize data in 
shared variables, compute using shared variables, and copy shared memory results 
to global memory. Thread blocks of sequentially dependent grids communicate 
via global memory, using it to read input and write results. 

Figure A.3.5 diagrams the nested levels of threads, thread blocks, and grids of 
thread blocks. It further shows the corresponding levels of memory sharing: local, 
shared, and global memories for per-thread, per-thread-block, and per-application 
data sharing. 

A program manages the global memory space visible to kernels through calls 
to the CUDA runtime, such as cudaMalloc() and cudaFree(). Kernels may 
execute on a physically separate device, as is the case when running kernels on 
the GPU. Consequently, the application must use cudaMemcpy() to copy data 
between the allocated space and the host system memory. 


atomic memory 
operation A memory 
read, modify, write 
operation sequence that 
completes without any 
intervening access. 


local memory Per-thread 
local memory private to 
the thread. 


shared memory Per- 
block memory shared by 
all threads of the block. 


global memory Per- 
application memory 
shared by all threads. 
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single-program multiple 
data (SPMD) A style 

of parallel programming 
model in which all 
threads execute the same 
program. SPMD threads 
typically coordinate with 
barrier synchronization. 


Thread 


Thread Block 


Sequence 


> 
KKKKKKKKK 


FIGUREA.3.5 Nested granularity levels—thread, thread block, and grid—have corresponding 
memory sharing levels—local, shared, and global. Per-thread local memory is private to the thread. 
Per-block shared memory is shared by all threads of the block. Per-application global memory is shared by all 
threads. 


The CUDA programming model is similar in style to the familiar single-program 
multiple data (SPMD) model—it expresses parallelism explicitly, and each kernel 
executes on a fixed number of threads. However, CUDA is more flexible than most 
realizations of SPMD, because each kernel call dynamically creates a new grid with 
the right number of thread blocks and threads for that application step. The pro- 
grammer can use a convenient degree of parallelism for each kernel, rather than 
having to design all phases of the computation to use the same number of threads. 
Figure A.3.6 shows an example of an SPMD-like CUDA code sequence. It first 
instantiates kerne lF ona 2D grid of 3 x 2 blocks where each 2D thread block con- 
sists of 5 X 3 threads. It then instantiates kerne|G ona 1D grid of four 1D thread 
blocks with six threads each. Because kerne |G depends on the results of kernel F, 
they are separated by an interkernel synchronization barrier. 

The concurrent threads of a thread block express fine-grained data paral- 
lelism and thread parallelism. The independent thread blocks of a grid express 
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Sequence kernelF 2D Grid is 3 x 2 thread blocks; each block is 5 x 3 threads 


Block 0,0 | Block1,0 | Block 2,0 


kernel F<<<(3, 2), (5, 3)>>>(params); 


Block 0,1 | Block 1,1 | Block 2, 1 


ode p” Block 1,1 \ \ 


Thread 0, 0 Thread 1, 0 Thread 2, 0 Thread 3, 0 Thread 4, 0 


Thread 0, 1 Thread 1, 1 Thread 2, 1 Thread 3, 1 Thread 4, 1 


Thread 0, 2 Thread 1, 2 Thread 2, 2 Thread 3, 2 Thread 4, 2 


-------------- Interkernel Synchronization Barrier ------------------- 


kernelG 1D Grid is 4 thread blocks, each block is 6 threads 


Block 0 Block 1 Block 2 Block 3 
on as ee a a 
Block 2 


Thread O | Thread1 | Thread2 | Thread3 | Thread4 | Thread 5 


Me Se Ms We es 


FIGURE A.3.6 Sequence of kernel F instantiated on a 2D grid of 2D thread blocks, an interkernel 
synchronization barrier, followed by kernel G on a 1D grid of 1D thread blocks. 


coarse-grained data parallelism. Independent grids express coarse-grained task 
parallelism. A kernel is simply C code for one thread of the hierarchy. 


Restrictions 


For efficiency, and to simplify its implementation, the CUDA programming model 
has some restrictions. Threads and thread blocks may only be created by invoking 
a parallel kernel, not from within a parallel kernel. Together with the required 
independence of thread blocks, this makes it possible to execute CUDA programs 
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with a simple scheduler that introduces minimal runtime overhead. In fact, the 
Tesla GPU architecture implements hardware management and scheduling of 
threads and thread blocks. 

Task parallelism can be expressed at the thread block level but is difficult to 
express within a thread block because thread synchronization barriers operate on 
all the threads of the block. To enable CUDA programs to run on any number of 
processors, dependencies among thread blocks within the same kernel grid are not 
allowed—blocks must execute independently. Since CUDA requires that thread 
blocks be independent and allows blocks to be executed in any order, combining 
results generated by multiple blocks must in general be done by launching a second 
kernel on a new grid of thread blocks (although thread blocks may coordinate their 
activities using atomic memory operations on the global memory visible to all 
threads—by atomically incrementing queue pointers, for example). 

Recursive function calls are not currently allowed in CUDA kernels. Recursion 
is unattractive in a massively parallel kernel, because providing stack space for the 
tens of thousands of threads that may be active would require substantial amounts 
of memory. Serial algorithms that are normally expressed using recursion, such as 
quicksort, are typically best implemented using nested data parallelism rather than 
explicit recursion. 

To support a heterogeneous system architecture combining a CPU and a 
GPU, each with its own memory system, CUDA programs must copy data and 
results between host memory and device memory. The overhead of CPU-GPU 
interaction and data transfers is minimized by using DMA block transfer engines 
and fast interconnects. Compute-intensive problems large enough to need a GPU 
performance boost amortize the overhead better than small problems. 


Implications for Architecture 


The parallel programming models for graphics and computing have driven 
GPU architecture to be different than CPU architecture. The key aspects of GPU 
programs driving GPU processor architecture are: 


m Extensive use of fine-grained data parallelism: Shader programs describe how 
to process a single pixel or vertex, and CUDA programs describe how to 
compute an individual result. 


m Highly threaded programming model: A shader thread program processes a 
single pixel or vertex, and a CUDA thread program may generate a single 
result. A GPU must create and execute millions of such thread programs per 
frame, at 60 frames per second. 


m Scalability: A program must automatically increase its performance when 
provided with additional processors, without recompiling. 


m Intensive floating-point (or integer) computation. 


m Support of high throughput computations. 
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Multithreaded Multiprocessor 
Architecture 


To address different market segments, GPUs implement scalable numbers of 
multiprocessors—in fact, GPUs are multiprocessors composed of multiprocessors. 
Furthermore, each multiprocessor is highly multithreaded to execute many fine- 
grained vertex and pixel shader threads efficiently. A quality basic GPU has two to 
four multiprocessors, while a gaming enthusiast’s GPU or computing platform has 
dozens of them. This section looks at the architecture of one such multithreaded 
multiprocessor, a simplified version of the NVIDIA Tesla streaming multiprocessor 
(SM) described in Section A.7. 

Why use a multiprocessor, rather than several independent processors? The 
parallelism within each multiprocessor provides localized high performance and 
supports extensive multithreading for the fine-grained parallel programming 
models described in Section A.3. The individual threads of a thread block execute 
together within a multiprocessor to share data. The multithreaded multiprocessor 
design we describe here has eight scalar processor cores in a tightly coupled archi- 
tecture, and executes up to 512 threads (the SM described in Section A.7 executes 
up to 768 threads). For area and power efficiency, the multiprocessor shares large 
complex units among the eight processor cores, including the instruction cache, 
the multithreaded instruction unit, and the shared memory RAM. 


Massive Multithreading 
GPU processors are highly multithreaded to achieve several goals: 
m Cover the latency of memory loads and texture fetches from DRAM 
m Support fine-grained parallel graphics shader programming models 
m Support fine-grained parallel computing programming models 
a 


Virtualize the physical processors as threads and thread blocks to provide 
transparent scalability 


m Simplify the parallel programming model to writing a serial program for one 
thread 


Memory and texture fetch latency can require hundreds of processor clocks, 
because GPUs typically have small streaming caches rather than large working-set 
caches like CPUs. A fetch request generally requires a full DRAM access latency 
plus interconnect and buffering latency. Multithreading helps cover the latency 
with useful computing—while one thread is waiting for a load or texture fetch 
to complete, the processor can execute another thread. The fine-grained parallel 
programming models provide literally thousands of independent threads that can 
keep many processors busy despite the long memory latency seen by individual 
threads. 
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A graphics vertex or pixel shader program is a program for a single thread that 
processes a vertex or a pixel. Similarly, a CUDA program is a C program for a 
single thread that computes a result. Graphics and computing programs instantiate 
many parallel threads to render complex images and compute large result arrays. 
To dynamically balance shifting vertex and pixel shader thread workloads, each 
multiprocessor concurrently executes multiple different thread programs and 
different types of shader programs. 

To support the independent vertex, primitive, and pixel programming model of 
graphics shading languages and the single-thread programming model of CUDA 
C/C++, each GPU thread has its own private registers, private per-thread memory, 
program counter, and thread execution state, and can execute an independent code 
path. To efficiently execute hundreds of concurrent lightweight threads, the GPU 
multiprocessor is hardware multithreaded—it manages and executes hundreds 
of concurrent threads in hardware without scheduling overhead. Concurrent 
threads within thread blocks can synchronize at a barrier with a single instruction. 
Lightweight thread creation, zero-overhead thread scheduling, and fast barrier 
synchronization efficiently support very fine-grained parallelism. 


Multiprocessor Architecture 


A unified graphics and computing multiprocessor executes vertex, geometry, and 
pixel fragment shader programs, and parallel computing programs. As Figure A.4.1 
shows, the example multiprocessor consists of eight scalar processor (SP) cores 
each with a large multithreaded register file (RF), two special function units (SFU), 
a multithreaded instruction unit, an instruction cache, a read-only constant cache, 
and a shared memory. 

The 16 KB shared memory holds graphics data buffers and shared computing 
data. CUDA variables declared as___shared__ reside in the shared memory. To 
map the logical graphics pipeline workload through the multiprocessor multiple 
times, as shown in Section A.2, vertex, geometry, and pixel threads have independent 
input and output buffers, and workloads arrive and depart independently of thread 
execution. 

Each SP core contains scalar integer and floating-point arithmetic units that 
execute most instructions. The SP is hardware multithreaded, supporting up to 
64 threads. Each pipelined SP core executes one scalar instruction per thread per 
clock, which ranges from 1.2 GHz to 1.6 GHz in different GPU products. Each 
SP core has a large register file (RF) of 1024 general-purpose 32-bit registers, 
partitioned among its assigned threads. Programs declare their register demand, 
typically 16 to 64 scalar 32-bit registers per thread. The SP can concurrently run 
many threads that use a few registers or fewer threads that use more registers. The 
compiler optimizes register allocation to balance the cost of spilling registers versus 
the cost of fewer threads. Pixel shader programs often use 16 or fewer registers, 
enabling each SP to run up to 64 pixel shader threads to cover long-latency texture 
fetches. Compiled CUDA programs often need 32 registers per thread, limiting 
each SP to 32 threads, which limits such a kernel program to 256 threads per thread 
block on this example multiprocessor, rather than its maximum of 512 threads. 
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Multithreaded Multiprocessor 


Instruction Cache 


Multiprocessor 


Multithreaded Instruction Unit Controller 


Constant Cache Work Interface 


Input 
Interface 


| Output 
Interface 


Texture 
Interface 


Memory 
Interface 


FIGURE A.4.1 Multithreaded multiprocessor with eight scalar processor (SP) cores. The 
eight SP cores each have a large multithreaded register file (RF) and share an instruction cache, multithreaded 
instruction issue unit, constant cache, two special function units (SFUs), interconnection network, and a 
multibank shared memory. 


The pipelined SFUs execute thread instructions that compute special functions 
and interpolate pixel attributes from primitive vertex attributes. These instructions 
can execute concurrently with instructions on the SPs. The SFU is described later. 

The multiprocessor executes texture fetch instructions on the texture unit via the 
texture interface, and uses the memory interface for external memory load, store, 
and atomic access instructions. These instructions can execute concurrently with 
instructions on the SPs. Shared memory access uses a low-latency interconnection 
network between the SP processors and the shared memory banks. 


Single-Instruction Multiple-Thread (SIMT) 


To manage and execute hundreds of threads running several different programs 
efficiently, the multiprocessor employs a single-instruction multiple-thread 
(SIMT) architecture. It creates, manages, schedules, and executes concurrent 
threads in groups of parallel threads called warps. The term warp originates from 
weaving, the first parallel thread technology. The photograph in Figure A.4.2 shows 
a warp of parallel threads emerging from a loom. This example multiprocessor 
uses a SIMT warp size of 32 threads, executing four threads in each of the eight 


single-instruction 
multiple-thread (SIMT) 
A processor architecture 
that applies one 
instruction to multiple 
independent threads in 
parallel. 


warp The set of parallel 
threads that execute the 
same instruction together 
in a SIMT architecture. 
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FIGURE A.4.2 SIMT multithreaded warp scheduling. The scheduler selects a ready warp and issues 
an instruction synchronously to the parallel threads composing the warp. Because warps are independent, 
the scheduler may select a different warp each time. 


SP cores over four clocks. The Tesla SM multiprocessor described in Section A.7 
also uses a warp size of 32 parallel threads, executing four threads per SP core for 
efficiency on plentiful pixel threads and computing threads. Thread blocks consist 
of one or more warps. 

This example SIMT multiprocessor manages a pool of 16 warps, a total of 512 
threads. Individual parallel threads composing a warp are the same type and start 
together at the same program address, but are otherwise free to branch and execute 
independently. At each instruction issue time, the SIMT multithreaded instruction 
unit selects a warp that is ready to execute its next instruction, then issues that 
instruction to the active threads of that warp. A SIMT instruction is broadcast 
synchronously to the active parallel threads of a warp; individual threads may be 
inactive due to independent branching or predication. In this multiprocessor, each 
SP scalar processor core executes an instruction for four individual threads of a 
warp using four clocks, reflecting the 4:1 ratio of warp threads to cores. 

SIMT processor architecture is akin to single-instruction multiple data (SIMD) 
design, which applies one instruction to multiple data lanes, but differs in that 
SIMT applies one instruction to multiple independent threads in parallel, not just 
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to multiple data lanes. An instruction for a SIMD processor controls a vector of 
multiple data lanes together, whereas an instruction for a SIMT processor controls 
an individual thread, and the SIMT instruction unit issues an instruction to a warp 
of independent parallel threads for efficiency. The SIMT processor finds data-level 
parallelism among threads at runtime, analogous to the way a superscalar processor 
finds instruction-level parallelism among instructions at runtime. 

A SIMT processor realizes full efficiency and performance when all threads 
of a warp take the same execution path. If threads of a warp diverge via a data- 
dependent conditional branch, execution serializes for each branch path taken, and 
when all paths complete, the threads converge to the same execution path. For equal 
length paths, a divergent if-else code block is 50% efficient. The multiprocessor uses 
a branch synchronization stack to manage independent threads that diverge and 
converge. Different warps execute independently at full speed regardless of whether 
they are executing common or disjoint code paths. As a result, SIMT GPUs are 
dramatically more efficient and flexible on branching code than earlier GPUs, as 
their warps are much narrower than the SIMD width of prior GPUs. 

In contrast with SIMD vector architectures, SIMT enables programmers to write 
thread-level parallel code for individual independent threads, as well as data-parallel 
code for many coordinated threads. For program correctness, the programmer can 
essentially ignore the SIMT execution attributes of warps; however, substantial 
performance improvements can be realized by taking care that the code seldom 
requires threads in a warp to diverge. In practice, this is analogous to the role of 
cache lines in traditional codes: cache line size can be safely ignored when designing 
for correctness but must be considered in the code structure when designing for 
peak performance. 


SIMT Warp Execution and Divergence 


The SIMT approach of scheduling independent warps is more flexible than the 
scheduling of previous GPU architectures. A warp comprises parallel threads 
of the same type: vertex, geometry, pixel, or compute. The basic unit of pixel 
fragment shader processing is the 2-by-2 pixel quad implemented as four pixel 
shader threads. The multiprocessor controller packs the pixel quads into a warp. It 
similarly groups vertices and primitives into warps, and packs computing threads 
into a warp. A thread block comprises one or more warps. The SIMT design shares 
the instruction fetch and issue unit efficiently across parallel threads of a warp, but 
requires a full warp of active threads to get full performance efficiency. 

This unified multiprocessor schedules and executes multiple warp types 
concurrently, allowing it to concurrently execute vertex and pixel warps. Its warp 
scheduler operates at less than the processor clock rate, because there are four thread 
lanes per processor core. During each scheduling cycle, it selects a warp to execute 
a SIMT warp instruction, as shown in Figure A.4.2. An issued warp-instruction 
executes as four sets of eight threads over four processor cycles of throughput. The 
processor pipeline uses several clocks of latency to complete each instruction. If the 
number of active warps times the clocks per warp exceeds the pipeline latency, the 
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cooperative thread 
array (CTA) A set 

of concurrent threads 
that executes the same 
thread program and may 
cooperate to compute 

a result. A GPU CTA 
implements a CUDA 
thread block. 


programmer can ignore the pipeline latency. For this multiprocessor, a round-robin 
schedule of eight warps has a period of 32 cycles between successive instructions 
for the same warp. If the program can keep 256 threads active per multiprocessor, 
instruction latencies up to 32 cycles can be hidden from an individual sequential 
thread. However, with few active warps, the processor pipeline depth becomes 
visible and may cause processors to stall. 

A challenging design problem is implementing zero-overhead warp scheduling 
for a dynamic mix of different warp programs and program types. The instruction 
scheduler must select a warp every four clocks to issue one instruction per clock 
per thread, equivalent to an IPC of 1.0 per processor core. Because warps are 
independent, the only dependencies are among sequential instructions from the 
same warp. The scheduler uses a register dependency scoreboard to qualify warps 
whose active threads are ready to execute an instruction. It prioritizes all such ready 
warps and selects the highest priority one for issue. Prioritization must consider 
warp type, instruction type, and the desire to be fair to all active warps. 


Managing Threads and Thread Blocks 


The multiprocessor controller and instruction unit manage threads and thread 
blocks. The controller accepts work requests and input data and arbitrates access 
to shared resources, including the texture unit, memory access path, and I/O paths. 
For graphics workloads, it creates and manages three types of graphics threads 
concurrently: vertex, geometry, and pixel. Each of the graphics work types have 
independent input and output paths. It accumulates and packs each of these input 
work types into SIMT warps of parallel threads executing the same thread program. 
It allocates a free warp, allocates registers for the warp threads, and starts warp 
execution in the multiprocessor. Every program declares its per-thread register 
demand; the controller starts a warp only when it can allocate the requested register 
count for the warp threads. When all the threads of the warp exit, the controller 
unpacks the results and frees the warp registers and resources. 

The controller creates cooperative thread arrays (CTAs) which implement 
CUDA thread blocks as one or more warps of parallel threads. It creates a CTA 
when it can create all CTA warps and allocate all CTA resources. In addition to 
threads and registers, a CTA requires allocating shared memory and barriers. The 
program declares the required capacities, and the controller waits until it can 
allocate those amounts before launching the CTA. Then it creates CTA warps at the 
warp scheduling rate, so that a CTA program starts executing immediately at full 
multiprocessor performance. The controller monitors when all threads of a CTA 
have exited, and frees the CTA shared resources and its warp resources. 


Thread Instructions 


The SP thread processors execute scalar instructions for individual threads, unlike 
earlier GPU vector instruction architectures, which executed four-component 
vector instructions for each vertex or pixel shader program. Vertex programs 
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generally compute (x, y, z, w) position vectors, while pixel shader programs 
compute (red, green, blue, alpha) color vectors. However, shader programs are 
becoming longer and more scalar, and it is increasingly difficult to fully occupy 
even two components of a legacy GPU four-component vector architecture. In 
effect, the SIMT architecture parallelizes across 32 independent pixel threads, 
rather than parallelizing the four vector components within a pixel. CUDA C/C++ 
programs have predominantly scalar code per thread. Previous GPUs employed 
vector packing (e.g., combining subvectors of work to gain efficiency) but that 
complicated the scheduling hardware as well as the compiler. Scalar instructions 
are simpler and compiler friendly. Texture instructions remain vector based, taking 
a source coordinate vector and returning a filtered color vector. 

To support multiple GPUs with different binary microinstruction formats, 
high-level graphics and computing language compilers generate intermediate 
assembler-level instructions (e.g., Direct3D vector instructions or PTX scalar 
instructions), which are then optimized and translated to binary GPU microin- 
structions. The NVIDIA PTX (parallel thread execution) instruction set definition 
[2007] provides a stable target ISA for compilers, and provides compatibility over 
several generations of GPUs with evolving binary microinstruction-set architec- 
tures. The optimizer readily expands Direct3D vector instructions to multiple sca- 
lar binary microinstructions. PTX scalar instructions translate nearly one to one 
with scalar binary microinstructions, although some PTX instructions expand to 
multiple binary microinstructions, and multiple PTX instructions may fold into 
one binary microinstruction. Because the intermediate assembler-level instruc- 
tions use virtual registers, the optimizer analyzes data dependencies and allocates 
real registers. The optimizer eliminates dead code, folds instructions together when 
feasible, and optimizes SIMT branch diverge and converge points. 


Instruction Set Architecture (ISA) 


The thread ISA described here is a simplified version of the Tesla architecture PTX 
ISA, a register-based scalar instruction set comprising floating-point, integer, logical, 
conversion, special functions, flow control, memory access, and texture operations. 
Figure A.4.3 lists the basic PTX GPU thread instructions; see the NVIDIA PTX 
specification [2007] for details. The instruction format is: 


opcode.type d, a, b, C; 


where d is the destination operand, a, b, c are source operands, and . type is one of: 


Untyped bits 8, 16, 32, and 64 bits .b8, .b16, .b32, .b64 
Unsigned integer 8, 16, 32, and 64 bits .u8, .u16, .u32, .u64 
Signed integer 8, 16, 32, and 64 bits .S8, .S16, .S32, .s64 
Floating-point 16, 32, and 64 bits .f16, .f32, .f64 
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Basic PTX GPU Thread Instructions 


a a 


arithmetic .type = .Ss32, .u32, 
add.type add.f32 d, a, b d=atb 
sub.type sub.f32 d, a, b d=a - b; 
mul.type mul.f32 d, a, b d =a * b; 
mad. type mad.f32 d, a, b, c d=a* bte; multiply-add 
div.type div.f32 d, a, b d=a/b; multiple microinstructions 
rem.type rem.u32 d, a, b d=aZb integer remainder 
hence abs.type abs. iu d, a d= jal; 
neg.type neg.f32 d, a d=0- a; 
min.type min.f32 d, a, b d = (a < b)? a:b; floating selects non-NaN 
max.type max.f32 d, a, b d = (a > b)? a:b; floating selects non-NaN 
setp.cmp.type setp.1t.f32 p, a, b p= €a< pis compare and set predicate 
numeric .cmp=eq, ne, It, le, gt, ge; unordered cmp= equ, neu, Itu, leu, gtu, geu, num, nan 
mov.type mov.b32 d, a d=a; move 
selp.type selp.f32 d, a, b, p d= p? a: b; select with predicate 
cvt.dtype.atype cvt.f32.532 d, a d = convert(a); convert atype to dtype 
special .type = .f32 (some . f64) 
rcp.type rcep.f32 d, a d = 1/a; reciprocal 
sqrt.type sqrt.f32 d, a d = sqrt(a); square root 
Special rsqrt.type rsqrt.f32 d, a d = 1/sqrt(a); reciprocal square root 
Function | sin.type sin.f32 d, a d = sin(a); sine 
cos.type cos.f32 d, a d = cos(a); cosine 
Ig2.type lg2.f32 d, a d = log(a)/log(2) binary logarithm 
ex2.type ex2.f32 d, a d = 2 ** a; binary exponential 
logic.type = .pred,.b32, .b64 
and.type and.b32 d, a, b d=a & b; 
or.type or.b32 d, a, b d= a- | b; 
, xor.type xor.b32 d, a, b d=a ^b; 
Logical 
not. type not.b32 d, a, b d = -~a; one’s complement 
cnot.type cnot.b32 d, a, b d = (a==0)? 1:0; C logical not 
shl.type shl.b32 d, a, b d=a << b; shift left 
shr.type shr.s32 d, @, b d =a >> b; shift right 
memory .space = .global, .shared, .local, .const; .type=.b8, .u8, .s8, .b16, .b32, .b64 
Id.space.type 1d.global.b32 d, [a+off] d = *(atoff); load from memory space 
Miery st.space.type st.shared.b32 [d+off], a *(dtoff) = a; store to memory space 
Ackees tex.nd.dtyp.btype tex.2d.v4.732.f32 d, a, b d = tex2d(a, b); texture lookup 
atom.spc.op.type |atom.global.add.u32 d,[a], b atomic { d = *a; atomic read-modify-write 
atom.global.cas.b32 d,[a], b, c *a = op(*a, b); } | operation 
atom .op=and, or, xor, add, min, max, exch, cas; .spc= .global; .type = .b32 
branch @p bra target if (p) goto conditional branch 
target; 
Control call call (ret), func, (params) ret = func(params); |call function 
Flow ret ret return; return from function call 
bar.sync bar.sync d wait for threads barrier synchronization 
exit exit exit; terminate thread execution 


FIGURE A.4.3 Basic PTX GPU thread instructions. 
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Source operands are scalar 32-bit or 64-bit values in registers, an immediate 
value, or a constant; predicate operands are 1-bit Boolean values. Destinations are 
registers, except for store to memory. Instructions are predicated by prefixing them 
with @p or @! p, where p is a predicate register. Memory and texture instructions 
transfer scalars or vectors of two to four components, up to 128 bits total. PTX 
instructions specify the behavior of one thread. 

The PTX arithmetic instructions operate on 32-bit and 64-bit floating-point, 
signed integer, and unsigned integer types. Recent GPUs support 64-bit double 
precision floating-point; see Section A.6. On current GPUs, PTX 64-bit integer 
and logical instructions are translated to two or more binary microinstructions 
that perform 32-bit operations. The GPU special function instructions are limited 
to 32-bit floating-point. The thread control flow instructions are conditional 
branch, function call and return, thread exit, and bar.sync (barrier 
synchronization). The conditional branch instruction @p bra target uses a 
predicate register p (or !p) previously set by a compare and set predicate setp 
instruction to determine whether the thread takes the branch or not. Other 
instructions can also be predicated on a predicate register being true or false. 


Memory Access Instructions 


The tex instruction fetches and filters texture samples from 1D, 2D, and 3D 
texture arrays in memory via the texture subsystem. Texture fetches generally use 
interpolated floating-point coordinates to address a texture. Once a graphics pixel 
shader thread computes its pixel fragment color, the raster operations processor 
blends it with the pixel color at its assigned (x, y) pixel position and writes the final 
color to memory. 

To support computing and C/C++ language needs, the Tesla PTX ISA implements 
memory load/store instructions. They use integer byte addressing with register plus 
offset address arithmetic to facilitate conventional compiler code optimizations. 
Memory load/store instructions are common in processors, but are a significant 
new capability in the Tesla architecture GPUs, as prior GPUs provided only the 
texture and pixel accesses required by the graphics APIs. 

For computing, the load/store instructions access three read/write memory 
spaces that implement the corresponding CUDA memory spaces in Section A.3: 


m Local memory for per-thread private addressable temporary data (imple- 
mented in external DRAM) 


E Shared memory for low-latency access to data shared by cooperating threads 
in the same CTA/thread block (implemented in on-chip SRAM) 


m Global memory for large data sets shared by all threads of a computing 
application (implemented in external DRAM) 


The memory load/store instructions 1d.global, st.global, ld.shared, 
st.shared, 1d.local,and st.1local access the global, shared, and local mem- 
ory spaces. Computing programs use the fast barrier synchronization instruction 
bar.sync to synchronize threads within a CTA/thread block that communicate 
with each other via shared and global memory. 
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To improve memory bandwidth and reduce overhead, the local and global load/ 
store instructions coalesce individual parallel thread requests from the same SIMT 
warp together into a single memory block request when the addresses fall in the 
same block and meet alignment criteria. Coalescing memory requests provides a 
significant performance boost over separate requests from individual threads. The 
multiprocessor’s large thread count, together with support for many outstanding 
load requests, helps cover load-to-use latency for local and global memory imple- 
mented in external DRAM. 

The latest Tesla architecture GPUs also provide efficient atomic memory opera- 
tions on memory with the atom. op. u32 instructions, including integer operations 
add, min, max, and, or, xor, exchange, and cas (compare-and-swap) opera- 
tions, facilitating parallel reductions and parallel data structure management. 


Barrier Synchronization for Thread Communication 


Fast barrier synchronization permits CUDA programs to communicate frequently 
via shared memory and global memory by simply calling _syncthreads(); as 
part of each interthread communication step. The synchronization intrinsic func- 
tion generates a single bar. sync instruction. However, implementing fast barrier 
synchronization among up to 512 threads per CUDA thread block is a challenge. 
Grouping threads into SIMT warps of 32 threads reduces the synchronization 
difficulty by a factor of 32. Threads wait at a barrier in the SIMT thread scheduler 
so they do not consume any processor cycles while waiting. When a thread executes 
a bar.sync instruction, it increments the barrier’s thread arrival counter and the 
scheduler marks the thread as waiting at the barrier. Once all the CTA threads 
arrive, the barrier counter matches the expected terminal count, and the scheduler 
releases all the threads waiting at the barrier and resumes executing threads. 


Streaming Processor (SP) 


The multithreaded streaming processor (SP) core is the primary thread instruction 
processor in the multiprocessor. Its register file (RF) provides 1024 scalar 
32-bit registers for up to 64 threads. It executes all the fundamental floating-point 
operations, including add.f32, mul.f32, mad.f32 (floating multiply-add), 
min. f32, max.f32, and setp.f32 (floating compare and set predicate). The 
floating-point add and multiply operations are compatible with the IEEE 754 
standard for single precision FP numbers, including not-a-number (NaN) and 
infinity values. The SP core also implements all of the 32-bit and 64-bit integer 
arithmetic, comparison, conversion, and logical PTX instructions in Figure A.4.3. 

The floating-point add and mu1 operations employ IEEE round-to-nearest-even 
as the default rounding mode. The mad . f 32 floating-point multiply-add operation 
performs a multiplication with truncation, followed by an addition with round- 
to-nearest-even. The SP flushes input denormal operands to sign-preserved-zero. 
Results that underflow the target output exponent range are flushed to sign- 
preserved-zero after rounding. 
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Special Function Unit (SFU) 


Certain thread instructions can execute on the SFUs, concurrently with other 
thread instructions executing on the SPs. The SFU implements the special function 
instructions of Figure A.4.3, which compute 32-bit floating-point approximations 
to reciprocal, reciprocal square root, and key transcendental functions. It also 
implements 32-bit floating-point planar attribute interpolation for pixel shaders, 
providing accurate interpolation of attributes such as color, depth, and texture 
coordinates. 

Each pipelined SFU generates one 32-bit floating-point special function result 
per cycle; the two SFUs per multiprocessor execute special function instructions 
at a quarter the simple instruction rate of the eight SPs. The SFUs also execute the 
mul .f32 multiply instruction concurrently with the eight SPs, increasing the peak 
computation rate up to 50% for threads with a suitable instruction mixture. 

For functional evaluation, the Tesla architecture SFU employs quadratic 
interpolation based on enhanced minimax approximations for approximating the 
reciprocal, reciprocal square-root, log,x, 2*, and sin/cos functions. The accuracy of 
the function estimates ranges from 22 to 24 mantissa bits. See Section A.6 for more 
details on SFU arithmetic. 


Comparing with Other Multiprocessors 


Compared with SIMD vector architectures such as x86 SSE, the SIMT multipro- 
cessor can execute individual threads independently, rather than always executing 
them together in synchronous groups. SIMT hardware finds data parallelism among 
independent threads, whereas SIMD hardware requires the software to express data 
parallelism explicitly in each vector instruction. A SIMT machine executes a warp 
of 32 threads synchronously when the threads take the same execution path, yet 
can execute each thread independently when they diverge. The advantage is signi- 
ficant because SIMT programs and instructions simply describe the behavior of a 
single independent thread, rather than a SIMD data vector of four or more data 
lanes. Yet the SIMT multiprocessor has SIMD-like efficiency, spreading the area 
and cost of one instruction unit across the 32 threads of a warp and across the eight 
streaming processor cores. SIMT provides the performance of SIMD together with 
the productivity of multithreading, avoiding the need to explicitly code SIMD vec- 
tors for edge conditions and partial divergence. 

The SIMT multiprocessor imposes little overhead because it is hardware 
multithreaded with hardware barrier synchronization. That allows graphics 
shaders and CUDA threads to express very fine-grained parallelism. Graphics and 
CUDA programs use threads to express fine-grained data parallelism in a per- 
thread program, rather than forcing the programmer to express it as SIMD vector 
instructions. It is simpler and more productive to develop scalar single-thread code 
than vector code, and the SIMT multiprocessor executes the code with SIMD-like 
efficiency. 
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Coupling eight streaming processor cores together closely into a multiprocessor 
and then implementing a scalable number of such multiprocessors makes a two- 
level multiprocessor composed of multiprocessors. The CUDA programming model 
exploits the two-level hierarchy by providing individual threads for fine-grained 
parallel computations, and by providing grids of thread blocks for coarse-grained 
parallel operations. The same thread program can provide both fine-grained and 
coarse-grained operations. In contrast, CPUs with SIMD vector instructions must 
use two different programming models to provide fine-grained and coarse-grained 
operations: coarse-grained parallel threads on different cores, and SIMD vector 
instructions for fine-grained data parallelism. 


Multithreaded Multiprocessor Conclusion 


The example GPU multiprocessor based on the Tesla architecture is highly 
multithreaded, executing a total of up to 512 lightweight threads concurrently to 
support fine-grained pixel shaders and CUDA threads. It uses a variation on SIMD 
architecture and multithreading called SIMT (single-instruction multiple-thread) 
to efficiently broadcast one instruction to a warp of 32 parallel threads, while 
permitting each thread to branch and execute independently. Each thread executes 
its instruction stream on one of the eight streaming processor (SP) cores, which are 
multithreaded up to 64 threads. 

The PTX ISA is a register-based load/store scalar ISA that describes the execution 
of a single thread. Because PTX instructions are optimized and translated to binary 
microinstructions for a specific GPU, the hardware instructions can evolve rapidly 
without disrupting compilers and software tools that generate PTX instructions. 


Parallel Memory System 


Outside of the GPU itself, the memory subsystem is the most important determiner 
of the performance of a graphics system. Graphics workloads demand very high 
transfer rates to and from memory. Pixel write and blend (read-modify-write) 
operations, depth buffer reads and writes, and texture map reads, as well as 
command and object vertex and attribute data reads, comprise the majority of 
memory traffic. 

Modern GPUs are highly parallel, as shown in Figure A.2.5. For example, the 
GeForce 8800 can process 32 pixels per clock, at 600 MHz. Each pixel typically 
requires a color read and write and a depth read and write of a 4-byte pixel. Usually 
an average of two or three texels of four bytes each are read to generate the pixel’s 
color. So for a typical case, there is a demand of 28 bytes times 32 pixels = 896 bytes 
per clock. Clearly the bandwidth demand on the memory system is enormous. 
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To supply these requirements, GPU memory systems have the following 
characteristics: 


m They are wide, meaning there are a large number of pins to convey data 
between the GPU and its memory devices, and the memory array itself 
comprises many DRAM chips to provide the full total data bus width. 


m They are fast, meaning aggressive signaling techniques are used to maximize 
the data rate (bits/second) per pin. 


m GPUs seek to use every available cycle to transfer data to or from the memory 
array. To achieve this, GPUs specifically do not aim to minimize latency 
to the memory system. High throughput (utilization efficiency) and short 
latency are fundamentally in conflict. 


m Compression techniques are used, both lossy, of which the programmer 
must be aware, and lossless, which is invisible to the application and 
opportunistic. 


m Caches and work coalescing structures are used to reduce the amount of 
off-chip traffic needed and to ensure that cycles spent moving data are used 
as fully as possible. 


DRAM Considerations 


GPUs must take into account the unique characteristics of DRAM. DRAM chips 
are internally arranged as multiple (typically four to eight) banks, where each bank 
includes a power-of-2 number of rows (typically around 16,384), and each row 
contains a power-of-2 number of bits (typically 8192). DRAMs impose a variety of 
timing requirements on their controlling processor. For example, dozens of cycles 
are required to activate one row, but once activated, the bits within that row are 
randomly accessible with a new column address every four clocks. Double-data 
rate (DDR) synchronous DRAMs transfer data on both rising and falling edges 
of the interface clock (see Chapter 5). So a 1 GHz clocked DDR DRAM transfers 
data at 2 gigabits per second per data pin. Graphics DDR DRAMs usually have 
32 bidirectional data pins, so eight bytes can be read or written from the DRAM 
per clock. 

GPUs internally have a large number of generators of memory traffic. 
Different stages of the logical graphics pipeline each have their own request 
streams: command and vertex attribute fetch, shader texture fetch and load/ 
store, and pixel depth and color read-write. At each logical stage, there are often 
multiple independent units to deliver the parallel throughput. These are each 
independent memory requestors. When viewed at the memory system, there 
are an enormous number of uncorrelated requests in flight. This is a natural 
mismatch to the reference pattern preferred by the DRAMs. A solution is for 
the GPU’s memory controller to maintain separate heaps of traffic bound for 
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different DRAM banks, and wait until enough traffic for a particular DRAM 
row is pending before activating that row and transferring all the traffic at once. 
Note that accumulating pending requests, while good for DRAM row locality 
and thus efficient use of the data bus, leads to longer average latency as seen by 
the requestors whose requests spend time waiting for others. The design must 
take care that no particular request waits too long, otherwise some processing 
units can starve waiting for data and ultimately cause neighboring processors to 
become idle. 

GPU memory subsystems are arranged as multiple memory partitions, each of 
which comprises a fully independent memory controller and one or two DRAM 
devices that are fully and exclusively owned by that partition. To achieve the best 
load balance and therefore approach the theoretical performance of n partitions, 
addresses are finely interleaved evenly across all memory partitions. The partition 
interleaving stride is typically a block of a few hundred bytes. The number of 
memory partitions is designed to balance the number of processors and other 
memory requesters. 


Caches 


GPU workloads typically have very large working sets—on the order of hundreds 
of megabytes to generate a single graphics frame. Unlike with CPUs, it is not 
practical to construct caches on chips large enough to hold anything close to the 
full working set of a graphics application. Whereas CPUs can assume very high 
cache hit rates (99.9% or more), GPUs experience hit rates closer to 90% and must 
therefore cope with many misses in flight. While a CPU can reasonably be designed 
to halt while waiting for a rare cache miss, a GPU needs to proceed with misses and 
hits intermingled. We call this a streaming cache architecture. 

GPU caches must deliver very high-bandwidth to their clients. Consider the case 
of a texture cache. A typical texture unit may evaluate two bilinear interpolations 
for each of four pixels per clock cycle, and a GPU may have many such texture 
units all operating independently. Each bilinear interpolation requires four 
separate texels, and each texel might be a 64-bit value. Four 16-bit components 
are typical. Thus, total bandwidth is 2 x 4 x 4 x 64 = 2048 bits per clock. Each 
separate 64-bit texel is independently addressed, so the cache needs to handle 
32 unique addresses per clock. This naturally favors a multibank and/or multiport 
arrangement of SRAM arrays. 


MMU 


Modern GPUs are capable of translating virtual addresses to physical addresses. 
On the GeForce 8800, all processing units generate memory addresses in a 
40-bit virtual address space. For computing, load and store thread instructions use 
32-bit byte addresses, which are extended to a 40-bit virtual address by adding a 
40-bit offset. A memory management unit performs virtual to physical address 
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translation; hardware reads the page tables from local memory to respond to misses 
on behalf of a hierarchy of translation lookaside buffers spread out among the 
processors and rendering engines. In addition to physical page bits, GPU page table 
entries specify the compression algorithm for each page. Page sizes range from 4 to 
128 kilobytes. 


Memory Spaces 


As introduced in Section A.3, CUDA exposes different memory spaces to allow the 
programmer to store data values in the most performance-optimal way. For the 
following discussion, NVIDIA Tesla architecture GPUs are assumed. 


Global memory 


Global memory is stored in external DRAM; it is not local to any one physical 
streaming multiprocessor (SM) because it is meant for communication among 
different CTAs (thread blocks) in different grids. In fact, the many CTAs that 
reference a location in global memory may not be executing in the GPU at the 
same time; by design, in CUDA a programmer does not know the relative order 
in which CTAs are executed. Because the address space is evenly distributed 
among all memory partitions, there must be a read/write path from any streaming 
multiprocessor to any DRAM partition. 

Access to global memory by different threads (and different processors) is not 
guaranteed to have sequential consistency. Thread programs see a relaxed memory 
ordering model. Within a thread, the order of memory reads and writes to the same 
address is preserved, but the order of accesses to different addresses may not be 
preserved. Memory reads and writes requested by different threads are unordered. 
Within a CTA, the barrier synchronization instruction bar.sync can be used 
to obtain strict memory ordering among the threads of the CTA. The membar 
thread instruction provides a memory barrier/fence operation that commits prior 
memory accesses and makes them visible to other threads before proceeding. 
Threads can also use the atomic memory operations described in Section A.4 to 
coordinate work on memory they share. 


Shared memory 


Per-CTA shared memory is only visible to the threads that belong to that CTA, 
and shared memory only occupies storage from the time a CTA is created to the 
time it terminates. Shared memory can therefore reside on-chip. This approach has 
many benefits. First, shared memory traffic does not need to compete with limited 
off-chip bandwidth needed for global memory references. Second, it is practical to 
build very high-bandwidth memory structures on-chip to support the read/write 
demands of each streaming multiprocessor. In fact, the shared memory is closely 
coupled to the streaming multiprocessor. 
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Each streaming multiprocessor contains eight physical thread processors. 
During one shared memory clock cycle, each thread processor can process two 
threads’ worth of instructions, so 16 threads’ worth of shared memory requests 
must be handled in each clock. Because each thread can generate its own addresses, 
and the addresses are typically unique, the shared memory is built using 16 
independently addressable SRAM banks. For common access patterns, 16 banks are 
sufficient to maintain throughput, but pathological cases are possible; for example, 
all 16 threads might happen to access a different address on one SRAM bank. It 
must be possible to route a request from any thread lane to any bank of SRAM, so 
a 16-by-16 interconnection network is required. 


Local Memory 


Per-thread local memory is private memory visible only to a single thread. Local 
memory is architecturally larger than the thread’s register file, and a program 
can compute addresses into local memory. To support large allocations of local 
memory (recall the total allocation is the per-thread allocation times the number 
of active threads), local memory is allocated in external DRAM. 

Although global and per-thread local memory reside off-chip, they are well- 
suited to being cached on-chip. 


Constant Memory 


Constant memory is read-only to a program running on the SM (it can be written 
via commands to the GPU). It is stored in external DRAM and cached in the SM. 
Because commonly most or all threads in a SIMT warp read from the same address 
in constant memory, a single address lookup per clock is sufficient. The constant 
cache is designed to broadcast scalar values to threads in each warp. 


Texture Memory 


Texture memory holds large read-only arrays of data. Textures for computing have 
the same attributes and capabilities as textures used with 3D graphics. Although 
textures are commonly two-dimensional images (2D arrays of pixel values), 1D 
(linear) and 3D (volume) textures are also available. 

A compute program references a texture using a tex instruction. Operands 
include an identifier to name the texture, and 1, 2, or 3 coordinates based on the 
texture dimensionality. The floating-point coordinates include a fractional portion 
that specifies a sample location often in between texel locations. Noninteger 
coordinates invoke a bilinear weighted interpolation of the four closest values (for 
a 2D texture) before the result is returned to the program. 

Texture fetches are cached in a streaming cache hierarchy designed to optimize 
throughput of texture fetches from thousands of concurrent threads. Some programs 
use texture fetches as a way to cache global memory. 


A.6 Floating-point 


A-41 


Surfaces 


Surface is a generic term for a one-dimensional, two-dimensional, or three- 
dimensional array of pixel values and an associated format. A variety of formats 
are defined; for example, a pixel may be defined as four 8-bit RGBA integer 
components, or four 16-bit floating-point components. A program kernel does 
not need to know the surface type. A tex instruction recasts its result values as 
floating-point, depending on the surface format. 


Load/Store Access 


Load/store instructions with integer byte addressing enable the writing and com- 
piling of programs in conventional languages like C and C++. CUDA programs use 
load/store instructions to access memory. 

To improve memory bandwidth and reduce overhead, the local and global load/ 
store instructions coalesce individual parallel thread requests from the same warp 
together into a single memory block request when the addresses fall in the same 
block and meet alignment criteria. Coalescing individual small memory requests 
into large block requests provides a significant performance boost over separate 
requests. The large thread count, together with support for many outstanding load 
requests, helps cover load-to-use latency for local and global memory implemented 
in external DRAM. 


ROP 


As shown in Figure A.2.5, NVIDIA Tesla architecture GPUs comprise a scalable 
streaming processor array (SPA), which performs all of the GPU’s programmable 
calculations, and a scalable memory system, which comprises external DRAM 
control and fixed function Raster Operation Processors (ROPs) that perform color 
and depth framebuffer operations directly on memory. Each ROP unit is paired 
with a specific memory partition. ROP partitions are fed from the SMs via an 
interconnection network. Each ROP is responsible for depth and stencil tests and 
updates, as well as color blending. The ROP and memory controllers cooperate 
to implement lossless color and depth compression (up to 8:1) to reduce external 
bandwidth demand. ROP units also perform atomic operations on memory. 


OA Floating-point Arithmetic 


GPUs today perform most arithmetic operations in the programmable processor 
cores using IEEE 754—compatible single precision 32-bit floating-point operations 
(see Chapter 3). The fixed-point arithmetic of early GPUs was succeeded by 16-bit, 
24-bit, and 32-bit floating-point, then IEEE 754—compatible 32-bit floating-point. 
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half precision A 16-bit 
binary floating-point 
format, with 1 sign bit, 
5-bit exponent, 10-bit 
fraction, and an implied 
integer bit. 


multiply-add (MAD) 

A single floating-point 
instruction that performs 
a compound operation: 
multiplication followed 
by addition. 


Some fixed-function logic within a GPU, such as texture-filtering hardware, 
continues to use proprietary numeric formats. Recent GPUs also provide IEEE 754 
compatible double precision 64-bit floating-point instructions. 


Supported Formats 


The IEEE 754 standard for floating-point arithmetic [2008] specifies basic and 
storage formats. GPUs use two of the basic formats for computation, 32-bit and 
64-bit binary floating-point, commonly called single precision and double pre- 
cision. The standard also specifies a 16-bit binary storage floating-point format, 
half precision. GPUs and the Cg shading language employ the narrow 16-bit 
half data format for efficient data storage and movement, while maintaining high 
dynamic range. GPUs perform many texture filtering and pixel blending computa- 
tions at half precision within the texture filtering unit and the raster operations 
unit. The OpenEXR high dynamic-range image file format developed by Industrial 
Light and Magic [2003] uses the identical half format for color component values 
in computer imaging and motion picture applications. 


Basic Arithmetic 


Common single precision floating-point operations in GPU programmable cores 
include addition, multiplication, multiply-add, minimum, maximum, compare, 
set predicate, and conversions between integer and floating-point numbers. 
Floating-point instructions often provide source operand modifiers for negation 
and absolute value. 

The floating-point addition and multiplication operations of most GPUs today 
are compatible with the IEEE 754 standard for single precision FP numbers, includ- 
ing not-a-number (NaN) and infinity values. The FP addition and multiplica- 
tion operations use IEEE round-to-nearest-even as the default rounding mode. 
To increase floating-point instruction throughput, GPUs often use a compound 
multiply-add instruction (mad). The multiply-add operation performs FP multipli- 
cation with truncation, followed by FP addition with round-to-nearest-even. It 
provides two floating-point operations in one issuing cycle, without requiring the 
instruction scheduler to dispatch two separate instructions, but the computation 
is not fused and truncates the product before the addition. This makes it different 
from the fused multiply-add instruction discussed in Chapter 3 and later in this 
section. GPUs typically flush denormalized source operands to sign-preserved zero, 
and they flush results that underflow the target output exponent range to sign- 
preserved zero after rounding. 


Specialized Arithmetic 


GPUs provide hardware to accelerate special function computation, attribute 
interpolation, and texture filtering. Special function instructions include cosine, 
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sine, binary exponential, binary logarithm, reciprocal, and reciprocal square root. 
Attribute interpolation instructions provide efficient generation of pixel attributes, 
derived from plane equation evaluation. The special function unit (SFU) 
introduced in Section A.4 computes special functions and interpolates planar 
attributes [Oberman and Siu, 2005]. 

Several methods exist for evaluating special functions in hardware. It has been 
shown that quadratic interpolation based on Enhanced Minimax Approximations 
is a very efficient method for approximating functions in hardware, including 
reciprocal, reciprocal square-root, log,x, 2*, sin, and cos. 

We can summarize the method of SFU quadratic interpolation. For a binary 
input operand X with n-bit significand, the significand is divided into two parts: 
X, is the upper part containing m bits, and X is the lower part containing n-m bits. 
The upper m bits X, are used to consult a set of three lookup tables to return three 
finite-word coefficients Cy, C}, and C,. Each function to be approximated requires 
a unique set of tables. These coefficients are used to approximate a given function 
f(X) in the range X, <= X < X, + 2"™ by evaluating the expression: 


f(X) =C +C; Xi +C, X? 


The accuracy of each of the function estimates ranges from 22 to 24 significand 
bits. Example function statistics are shown in Figure A.6.1. 

The IEEE 754 standard specifies exact-rounding requirements for division 
and square root, however, for many GPU applications, exact compliance is not 
required. Rather, for those applications, higher computational throughput is more 
important than last-bit accuracy. For the SFU special functions, the CUDA math 
library provides both a full accuracy function and a fast function with the SFU 
instruction accuracy. 

Another specialized arithmetic operation in a GPU is attribute interpolation. 
Key attributes are usually specified for vertices of primitives that make up a scene 
to be rendered. Example attributes are color, depth, and texture coordinates. These 
attributes must be interpolated in the (x,y) screen space as needed to determine the 


Input Accuracy % exactly 
interval (good bits) rounded 
1/x [1, 2) 87 Yes 


24.02 0.98 
1/sqrt(x) A 23.40 1:52 78 Yes 
2* [O, 1) 22.51 1.41 74 Yes 
logax (1, 2) 22.57 N/A** N/A Yes 
sin/cos [0, 1/2) 22.47 N/A N/A No 


*ULP: unit in the last place. *“N/A: not applicable. 


FIGURE A.6.1 Special function approximation statistics. For the NVIDIA GeForce 8800 special 
function unit (SFU). 


special function unit 
(SFU) A hardware unit 
that computes special 
functions and interpolates 
planar attributes. 
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MIP-map A Latin phrase 
multum in parvo, or 
much in a small space. 

A MIP-map contains 
precalculated images of 
different resolutions, used 
to increase rendering 


speed and reduce artifacts. 


values of the attributes at each pixel location. The value of a given attribute Uin an 
(x,y) plane can be expressed using plane equations of the form: 


U(x y) =A,x+ Bay + Ca 


where A, B, and C are interpolation parameters associated with each attribute U. 
The interpolation parameters A, B, and C are all represented as single precision 
floating-point numbers. 

Given the need for both a function evaluator and an attribute interpolator in a 
pixel shader processor, a single SFU that performs both functions for efficiency can 
be designed. Both functions use a sum of products operation to interpolate results, 
and the number of terms to be summed in both functions is very similar. 


Texture Operations 


Texture mapping and filtering is another key set of specialized floating-point 
arithmetic operations in a GPU. The operations used for texture mapping include: 


1. Receive texture address (s, t) for the current screen pixel (x, y), where s and t 
are single precision floating-point numbers. 


2. Compute the level of detail to identify the correct texture MIP-map level. 
3. Compute the trilinear interpolation fraction. 

4. Scale texture address (s, t) for the selected MIP-map level. 

5. Access memory and retrieve desired texels (texture elements). 

6. Perform filtering operation on texels. 


Texture mapping requires a significant amount of floating-point computation 
for full-speed operation, much of which is done at 16-bit half precision. As an 
example, the GeForce 8800 Ultra delivers about 500 GFLOPS of proprietary format 
floating-point computation for texture mapping instructions, in addition to its 
conventional IEEE single precision floating-point instructions. For more details on 
texture mapping and filtering, see Foley and van Dam [1995]. 


Performance 


The floating-point addition and multiplication arithmetic hardware is fully 
pipelined, and latency is optimized to balance delay and area. While pipelined, the 
throughput of the special functions is less than the floating-point addition and 
multiplication operations. Quarter-speed throughput for the special functions 
is typical performance in modern GPUs, with one SFU shared by four SP cores. 
In contrast, CPUs typically have significantly lower throughput for similar 
functions, such as division and square root, albeit with more accurate results. The 
attribute interpolation hardware is typically fully pipelined to enable full-speed 
pixel shaders. 
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Double precision 


Newer GPUs such as the Tesla T10P also support IEEE 754 64-bit double precision 
operations in hardware. Standard floating-point arithmetic operations in double 
precision include addition, multiplication, and conversions between different 
floating-point and integer formats. The 2008 IEEE 754 floating-point standard 
includes specification for the fused-multiply-add operation (FMA), as discussed in 
Chapter 3. The FMA operation performs a floating-point multiplication followed 
by an addition, with a single rounding. The fused multiplication and addition 
operations retain full accuracy in intermediate calculations. This behavior enables 
more accurate floating-point computations involving the accumulation of prod- 
ucts, including dot products, matrix multiplication, and polynomial evaluation. 
The FMA instruction also enables efficient software implementations of exactly 
rounded division and square root, removing the need for a hardware division or 
square root unit. 

A double precision hardware FMA unit implements 64-bit addition, multipli- 
cation, conversions, and the FMA operation itself. The architecture of a double 


a on b l | 
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| 


FIGURE A.6.2 Double precision fused-multiply-add (FMA) unit. Hardware to implement floating- 


point A x B + C for double precision. 
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precision FMA unit enables full-speed denormalized number support on both 
inputs and outputs. Figure A.6.2 shows a block diagram of an FMA unit. 

As shown in Figure A.6.2, the significands of A and B are multiplied to form 
a 106-bit product, with the results left in carry-save form. In parallel, the 53-bit 
addend C is conditionally inverted and aligned to the 106-bit product. The sum and 
carry results of the 106-bit product are summed with the aligned addend through a 
161-bit-wide carry-saveadder (CSA). Thecarry-saveoutputisthen summed together 
in a carry-propagate adder to produce an unrounded result in nonredundant, two’s 
complement form. The result is conditionally recomplemented, so as to return a 
result in sign-magnitude form. The complemented result is normalized, and then 
it is rounded to fit within the target format. 


Real Stuff: The NVIDIA GeForce 8800 


The NVIDIA GeForce 8800 GPU, introduced in November 2006, is a unified vertex 
and pixel processor design that also supports parallel computing applications 
written in C using the CUDA parallel programming model. It is the first imple- 
mentation of the Tesla unified graphics and computing architecture described in 
Section A.4 and in Lindholm, Nickolls, Oberman, and Montrym [2008]. A family 
of Tesla architecture GPUs addresses the different needs of laptops, desktops, work- 
stations, and servers. 


Streaming Processor Array (SPA) 


The GeForce 8800 GPU shown in Figure A.7.1 contains 128 streaming processor 
(SP) cores organized as 16 streaming multiprocessors (SMs). Two SMs share 
a texture unit in each texture/processor cluster (TPC). An array of eight TPCs 
makes up the streaming processor array (SPA), which executes all graphics shader 
programs and computing programs. 

The host interface unit communicates with the host CPU via the PCI-Express 
bus, checks command consistency, and performs context switching. The input 
assembler collects geometric primitives (points, lines, triangles). The work distri- 
bution blocks dispatch vertices, pixels, and compute thread arrays to the TPCs in 
the SPA. The TPCs execute vertex and geometry shader programs and computing 
programs. Output geometric data is sent to the viewport/clip/setup/raster/zcull 
block to be rasterized into pixel fragments that are then redistributed back into the 
SPA to execute pixel shader programs. Shaded pixels are sent across the intercon- 
nection network for processing by the ROP units. The network also routes tex- 
ture memory read requests from the SPA to DRAM and reads data from DRAM 
through a level-2 cache back to the SPA. 
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FIGURE A.7.1 NVIDIA Tesla unified graphics and computing GPU architecture. This GeForce 8800 has 128 streaming processor 
(SP) cores in 16 streaming multiprocessors (SM), arranged in eight texture/processor clusters (TPC). The processors connect with six 64-bit-wide 
DRAM partitions via an interconnection network. Other GPUs implementing the Tesla architecture vary the number of SP cores, SMs, DRAM 
partitions, and other units. 


Texture/Processor Cluster (TPC) 


Each TPC contains a geometry controller, an SM controller (SMC), two streaming 
multiprocessors (SMs), and a texture unit as shown in Figure A.7. 2. 

The geometry controller maps the logical graphics vertex pipeline into recir- 
culation on the physical SMs by directing all primitive and vertex attribute and 
topology flow in the TPC. 

The SMC controls multiple SMs, arbitrating the shared texture unit, load/store 
path, and I/O path. The SMC serves three graphics workloads simultaneously: 
vertex, geometry, and pixel. 

The texture unit processes a texture instruction for one vertex, geometry, or pixel 
quad, or four compute threads per cycle. Texture instruction sources are texture 
coordinates, and the outputs are weighted samples, typically a four-component 
(RGBA) floating-point color. The texture unit is deeply pipelined. Although it 
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FIGURE A.7.2 Texture/processor cluster (TPC) and a streaming multiprocessor (SM). Each SM has eight streaming processor 
(SP) cores, two SFUs, and a shared memory. 


contains a streaming cache to capture filtering locality, it streams hits mixed with 
misses without stalling. 


Streaming Multiprocessor (SM) 


The SM is a unified graphics and computing multiprocessor that executes vertex, 
geometry, and pixel-fragment shader programs and parallel computing programs. 
The SM consists of eight SP thread processor cores, two SFUs, a multithreaded 
instruction fetch and issue unit (MT issue), an instruction cache, a read-only constant 
cache, and a 16 KB read/write shared memory. It executes scalar instructions for 
individual threads. 

The GeForce 8800 Ultra clocks the SP cores and SFUs at 1.5 GHz, for a peak of 
36 GFLOPS per SM. To optimize power and area efficiency, some SM nondatapath 
units operate at half the SP clock rate. 
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To efficiently execute hundreds of parallel threads while running several 
different programs, the SM is hardware multithreaded. It manages and executes up 
to 768 concurrent threads in hardware with zero scheduling overhead. Each thread 
has its own thread execution state and can execute an independent code path. 

A warp consists of up to 32 threads of the same type—vertex, geometry, pixel, 
or compute. The SIMT design, previously described in Section A.4, shares the SM 
instruction fetch and issue unit efficiently across 32 threads but requires a full warp 
of active threads for full performance efficiency. 

The SM schedules and executes multiple warp types concurrently. Each issue 
cycle, the scheduler selects one of the 24 warps to execute a SIMT warp instruction. 
An issued warp instruction executes as four sets of 8 threads over four processor 
cycles. The SP and SFU units execute instructions independently, and by issuing 
instructions between them on alternate cycles, the scheduler can keep both fully 
occupied. A scoreboard qualifies each warp for issue each cycle. The instruction 
scheduler prioritizes all ready warps and selects the one with highest priority for 
issue. Prioritization considers warp type, instruction type, and “fairness” to all 
warps executing in the SM. 

The SM executes cooperative thread arrays (CTAs) as multiple concurrent 
warps which access a shared memory region allocated dynamically for the CTA. 


Instruction Set 


Threads execute scalar instructions, unlike previous GPU vector instruction 
architectures. Scalar instructions are simpler and compiler friendly. Texture 
instructions remain vector based, taking a source coordinate vector and returning 
a filtered color vector. 

The register-based instruction set includes all the floating-point and integer 
arithmetic, transcendental, logical, flow control, memory load/store, and texture 
instructions listed in the PTX instruction table of Figure A.4.3. Memory load/store 
instructions use integer byte addressing with register-plus-offset address arithmetic. 
For computing, the load/store instructions access three read-write memory spaces: 
local memory for per-thread, private, temporary data; shared memory for low- 
latency per-CTA data shared by the threads of the CTA; and global memory for 
data shared by all threads. Computing programs use the fast barrier synchroniza- 
tion bar. sync instruction to synchronize threads within a CTA that communicate 
with each other via shared and global memory. The latest Tesla architecture GPUs 
implement PTX atomic memory operations, which facilitate parallel reductions 
and parallel data structure management. 


Streaming Processor (SP) 


The multithreaded SP core is the primary thread processor, as introduced in 
Section A.4. Its register file provides 1024 scalar 32-bit registers for up to 96 
threads (more threads than the example SP of Section A.4). Its floating-point 
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add and multiply operations are compatible with the IEEE 754 standard for 
single precision FP numbers, including not-a-number (NaN) and infinity. The 
add and multiply operations use IEEE round-to-nearest-even as the default 
rounding mode. The SP core also implements all of the 32-bit and 64-bit integer 
arithmetic, comparison, conversion, and logical PTX instructions in Figure A.4.3. 
The processor is fully pipelined, and latency is optimized to balance delay 
and area. 


Special Function Unit (SFU) 


The SFU supports computation of both transcendental functions and planar 
attribute interpolation. As described in Section A.6, it uses quadratic interpola- 
tion based on enhanced minimax approximations to approximate the reciprocal, 
reciprocal square root, log,x, 2%, and sin/cos functions at one result per cycle. The 
SFU also supports pixel attribute interpolation such as color, depth, and texture 
coordinates at four samples per cycle. 


Rasterization 


Geometry primitives from the SMs go in their original round-robin input order 
to the viewport/clip/setup/raster/zcull block. The viewport and clip units clip 
the primitives to the view frustum and to any enabled user clip planes, and then 
transform the vertices into screen (pixel) space. 

Surviving primitives then go to the setup unit, which generates edge equations 
for the rasterizer. A coarse-rasterization stage generates all pixel tiles that are at 
least partially inside the primitive. The zcull unit maintains a hierarchical z surface, 
rejecting pixel tiles if they are conservatively known to be occluded by previously 
drawn pixels. The rejection rate is up to 256 pixels per clock. Pixels that survive 
zcull then go to a fine-rasterization stage that generates detailed coverage informa- 
tion and depth values. 

The depth test and update can be performed ahead of the fragment shader, or 
after, depending on current state. The SMC assembles surviving pixels into warps 
to be processed by an SM running the current pixel shader. The SMC then sends 
surviving pixel and associated data to the ROP. 


Raster Operations Processor (ROP) and Memory System 


Each ROP is paired with a specific memory partition. For each pixel fragment 
emitted by a pixel shader program, ROPs perform depth and stencil testing and 
updates, and in parallel, color blending and updates. Lossless color compression (up 
to 8:1) and depth compression (up to 8:1) are used to reduce DRAM bandwidth. 
Each ROP has a peak rate of four pixels per clock and supports 16-bit floating- 
point and 32-bit floating-point HDR formats. ROPs support double-rate-depth 
processing when color writes are disabled. 
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Antialiasing support includes up to 16x multisampling and supersampling. The 
coverage-sampling antialiasing (CSAA) algorithm computes and stores Boolean 
coverage at up to 16 samples and compresses redundant color, depth, and stencil 
information into the memory footprint and a bandwidth of four or eight samples 
for improved performance. 

The DRAM memory data bus width is 384 pins, arranged in six independent 
partitions of 64 pins each. Each partition supports double-data-rate DDR2 and 
graphics-oriented GDDR3 protocols at up to 1.0 GHz, yielding a bandwidth of 
about 16 GB/s per partition, or 96 GB/s. 

The memory controllers support a wide range of DRAM clock rates, protocols, 
device densities, and data bus widths. Texture and load/store requests can occur 
between any TPC and any memory partition, so an interconnection network routes 
requests and responses. 


Scalability 


The Tesla unified architecture is designed for scalability. Varying the number of 
SMs, TPCs, ROPs, caches, and memory partitions provides the right balance for 
different performance and cost targets in GPU market segments. Scalable link 
interconnect (SLI) connects multiple GPUs, providing further scalability. 


Performance 


The GeForce 8800 Ultra clocks the SP thread processor cores and SFUs at 1.5 GHz, 
for a theoretical operation peak of 576 GFLOPS. The GeForce 8800 GTX has a 
1.35 GHz processor clock and a corresponding peak of 518 GFLOPS. 

The following three sections compare the performance of a GeForce 8800 GPU 
with a multicore CPU on three different applications—dense linear algebra, fast 
Fourier transforms, and sorting. The GPU programs and libraries are compiled 
CUDA C code. The CPU code uses the single precision multithreaded Intel MKL 
10.0 library to leverage SSE instructions and multiple cores. 


Dense Linear Algebra Performance 


Dense linear algebra computations are fundamental in many applications. Volkov 
and Demmel [2008] present GPU and CPU performance results for single precision 
dense matrix-matrix multiplication (the SGEMM routine) and LU, QR, and 
Cholesky matrix factorizations. Figure A.7.3 compares GFLOPS rates on SGEMM 
dense matrix-matrix multiplication for a GeForce 8800 GTX GPU with a quad-core 
CPU. Figure A.7.4 compares GFLOPS rates on matrix factorization for a GPU with 
a quad-core CPU. 

Because SGEMM matrix-matrix multiply and similar BLAS3 routines are the 
bulk of the work in matrix factorization, their performance sets an upper bound on 
factorization rate. As the matrix order increases beyond 200 to 400, the factorization 
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FIGURE A.7.3  SGEMM dense matrix-matrix multiplication performance rates. The graph 
shows single precision GFLOPS rates achieved in multiplying square NXN matrices (solid lines) and thin 
Nx64 and 64xN matrices (dashed lines). Adapted from Figure 6 of Volkov and Demmel [2008]. The black 
lines are a 1.35 GHz GeForce 8800 GTX using Volkov’s SGEMM code (now in NVIDIA CUBLAS 2.0) on 
matrices in GPU memory. The blue lines are a quad-core 2.4 GHz Intel Core2 Quad Q6600, 64-bit Linux, 
Intel MKL 10.0 on matrices in CPU memory. 
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FIGURE A.7.4 Dense matrix factorization performance rates. The graph shows GFLOPS rates 
achieved in matrix factorizations using the GPU and using the CPU alone. Adapted from Figure 7 of Volkov 
and Demmel [2008]. The black lines are a 1.35 GHz NVIDIA GeForce 8800 GTX, CUDA 1.1, Windows XP 
attached to a 2.67 GHz Intel Core2 Duo E6700 Windows XP, including all CPU-GPU data transfer times. The 
blue lines are a quad-core 2.4 GHz Intel Core2 Quad Q6600, 64-bit Linux, Intel MKL 10.0. 
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problem becomes large enough that SGEMM can leverage the GPU parallelism and 
overcome the CPU-—GPU system and copy overhead. Volkov’s SGEMM matrix- 
matrix multiply achieves 206 GFLOPS, about 60% of the GeForce 8800 GTX 
peak multiply-add rate, while the QR factorization reached 192 GFLOPS, about 
4.3 times the quad-core CPU. 


FFT Performance 


Fast Fourier Transforms are used in many applications. Large transforms and 
multidimensional transforms are partitioned into batches of smaller 1D transforms. 

Figure A.7.5 compares the in-place 1D complex single precision FFT perfor- 
mance of a 1.35 GHz GeForce 8800 GTX (dating from late 2006) with a 2.8 GHz 
quad-Core Intel Xeon E5462 series (code named “Harpertown,” dating from late 
2007). CPU performance was measured using the Intel Math Kernel Library (MKL) 
10.0 FFT with four threads. GPU performance was measured using the NVIDIA 
CUFFT 2.1 library and batched 1D radix-16 decimation-in-frequency FFTs. Both 
CPU and GPU throughput performance was measured using batched FFTs, batch 
size was 274/n, where n is the transform size. Thus, the workload for every trans- 
form size was 128 MB. To determine GFLOPS rate, the number of operations per 
transform was taken as 5n log, n. 
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FIGURE A.7.5 Fast Fourier Transform throughput performance. The graph compares the 
performance of batched one-dimensional in-place complex FFTs on a 1.35 GHz GeForce 8800 GTX with 
a quad-core 2.8 GHz Intel Xeon E5462 series (code named “Harpertown”), 6MB L2 Cache, 4GB Memory, 
1600 FSB, Red Hat Linux, Intel MKL 10.0. 
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Sorting Performance 


In contrast to the applications just discussed, sort requires far more substantial 
coordination among parallel threads, and parallel scaling is correspondingly 
harder to obtain. Nevertheless, a variety of well-known sorting algorithms can 
be efficiently parallelized to run well on the GPU. Satish, et al. [2008] detail the 
design of sorting algorithms in CUDA, and the results they report for radix sort are 
summarized below. 

Figure A.7.6 compares the parallel sorting performance of a GeForce 8800 Ultra 
with an 8-core Intel Clovertown system, both of which date to early 2007. The 
CPU cores are distributed between two physical sockets. Each socket contains a 
multichip module with twin Core2 chips, and each chip has a 4MB L2 cache. All 
sorting routines were designed to sort key-value pairs where both keys and values 
are 32-bit integers. The primary algorithm being studied is radix sort, although 
the quicksort-based paral lel_sort() procedure provided by Intel’s Threading 
Building Blocks is also included for comparison. Of the two CPU-based radix sort 
codes, one was implemented using only the scalar instruction set and the other 
utilizes carefully hand-tuned assembly language routines that take advantage of the 
SSE2 SIMD vector instructions. 

The graph itself shows the achieved sorting rate—defined as the number of ele- 
ments sorted divided by the time to sort—for a range of sequence sizes. It is apparent 
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FIGURE A.7.6 Parallel sorting performance. This graph compares sorting rates for parallel radix sort implementations on a 1.5 GHz 
GeForce 8800 Ultra and an 8-core 2.33 GHz Intel Core2 Xeon E5345 system. 
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from this graph that the GPU radix sort achieved the highest sorting rate for all 
sequences of 8K-elements and larger. In this range, it is on average 2.6 times faster 
than the quicksort-based routine and roughly 2 times faster than the radix sort rou- 
tines, all of which were using the eight available CPU cores. The CPU radix sort per- 
formance varies widely, likely due to poor cache locality of its global permutations. 


Real Stuff: Mapping Applications to GPUs 


The advent of multicore CPUs and manycore GPUs means that mainstream 
processor chips are now parallel systems. Furthermore, their parallelism continues 
to scale with Moore’s law. The challenge is to develop mainstream visual computing 
and high-performance computing applications that transparently scale their 
parallelism to leverage the increasing number of processor cores, much as 3D 
graphics applications transparently scale their parallelism to GPUs with widely 
varying numbers of cores. 

This section presents examples of mapping scalable parallel computing 
applications to the GPU using CUDA. 


Sparse Matrices 


A wide variety of parallel algorithms can be written in CUDA in a fairly 
straightforward manner, even when the data structures involved are not simple 
regular grids. Sparse matrix-vector multiplication (SpMV) is a good example of 
an important numerical building block that can be parallelized quite directly using 
the abstractions provided by CUDA. The kernels we discuss below, when combined 
with the provided CUBLAS vector routines, make writing iterative solvers such as 
the conjugate gradient method straightforward. 

A sparse n X n matrix is one in which the number of nonzero entries m is only 
a small fraction of the total. Sparse matrix representations seek to store only the 
nonzero elements of a matrix. Since it is fairly typical that a sparse nx n matrix will 
contain only m= O(n) nonzero elements, this represents a substantial savings in 
storage space and processing time. 

One of the most common representations for general unstructured sparse 
matrices is the compressed sparse row (CSR) representation. The m nonzero 
elements of the matrix A are stored in row-major order in an array Av. A second 
array Aj records the corresponding column index for each entry of Av. Finally, an 
array Ap of n+ 1 elements records the extent of each row in the previous arrays; the 
entries for row iin Aj and Av extend from index Api ] up to, but not including, 
index ApLi + 1]. This implies that Ap[ 0] will always be 0 and Ap[n] will always 
be the number of nonzero elements in the matrix. Figure A.8.1 shows an example 
of the CSR representation of a simple matrix. 
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FIGURE A.8.1 Compressed sparse row (CSR) matrix. 


float multiply_row(unsigned int rowsize, 


} 


unsigned int *Aj, // column indices for row 
float *Av, // nonzero entries for row 
float *x) // the RHS vector 

float sum = 0; 


for(unsigned int column=0; column<rowsize; ++column) 
sum += Av[column] * x[LAjlcolumn]]; 


return sum; 


FIGURE A.8.2 Serial C code for a single row of sparse matrix-vector multiply. 


Given a matrix A in CSR form and a vector x, we can compute a single row of 
the product y= Ax using the mul tip] y_row() procedure shown in Figure A.8.2. 
Computing the full product is then simply a matter of looping over all rows and 
computing the result for that row using mul tiply_row(), asin the serial C code 
shown in Figure A.8.3. 

This algorithm can be translated into a parallel CUDA kernel quite easily. We 
simply spread the loop in csrmul_serial() over many parallel threads. Each 
thread will compute exactly one row of the output vector y. The code for this kernel 
is shown in Figure A.8.4. Note that it looks extremely similar to the serial loop 
used in the csrmul_serial() procedure. There are really only two points of 
difference. First, the row index for each thread is computed from the block and 
thread indices assigned to each thread, eliminating the for-loop. Second, we have a 
conditional that only evaluates a row product if the row index is within the bounds 
of the matrix (this is necessary since the number of rows n need not be a multiple 
of the block size used in launching the kernel). 
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void csrmul_serial(unsigned int *Ap, unsigned int *Aj, 
float *Av, unsigned int num_rows, 
float *x, float *y) 


for(unsigned int row=0; row<num_rows; ++row) 

{ 
unsigned int row_begin = ApLrow]; 
unsigned int row_end ApLrow+1]; 


yLrow] = multiply_row(row_end-row_begin, Aj+row_begin, 
Av+row_begin, x); 


} 


FIGURE A.8.3 Serial code for sparse matrix-vector multiply. 


— global__ 

void csrmul_kernel(unsigned int *Ap, unsigned int *Aj, 
float *Av, unsigned int num_rows, 
float *x, float *y) 


unsigned int row = blockIdx.x*blockDim.x + threadIdx.x; 


if( row<num_rows. ) 

{ 
unsigned int row_begin = Ap[row]; 
unsigned int row_end = ApLrow+1]; 


yLrow] = multiply_row(row_end-row_begin, Aj+row_begin, 
Avt+trow_begin, x); 


} 


FIGURE A.8.4 CUDA version of sparse matrix-vector multiply. 


Assuming that the matrix data structures have already been copied to the GPU 
device memory, launching this kernel will look like: 


unsigned int blocksize = 128; // or any size up to 512 
unsigned int nblocks = (num_rows + blocksize - 1) / blocksize; 
csrmul_kernel<<<nblocks,blocksize>>>(Ap, Aj, Av, num_rows, x, y); 
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The pattern that we see here is a very common one. The original serial 
algorithm is a loop whose iterations are independent of each other. Such loops 
can be parallelized quite easily by simply assigning one or more iterations of the 
loop to each parallel thread. The programming model provided by CUDA makes 
expressing this type of parallelism particularly straightforward. 

This general strategy of decomposing computations into blocks of independent 
work, and more specifically breaking up independent loop iterations, is not unique 
to CUDA. This is a common approach used in one form or another by various 
parallel programming systems, including OpenMP and Intel’s Threading Building 
Blocks. 


Caching in Shared memory 


The SpMV algorithms outlined above are fairly simplistic. There are a number of 
optimizations that can be made in both the CPU and GPU codes that can improve 
performance, including loop unrolling, matrix reordering, and register blocking. 
The parallel kernels can also be reimplemented in terms of data parallel scan 
operations presented by Sengupta, et al. [2007]. 

One of the important architectural features exposed by CUDA is the presence 
of the per-block shared memory, a small on-chip memory with very low latency. 
Taking advantage of this memory can deliver substantial performance improve- 
ments. One common way of doing this is to use shared memory as a software- 
managed cache to hold frequently reused data. Modifications using shared memory 
are shown in Figure A.8.5. 

In the context of sparse matrix multiplication, we observe that several rows of 
A may use a particular array element x [ i ]. In many common cases, and particularly 
when the matrix has been reordered, the rows using x[i ] will be rows near row i. 
We can therefore implement a simple caching scheme and expect to achieve some 
performance benefit. The block of threads processing rows i through j will load 
x[i] through x[j ] into its shared memory. We will unroll the mul tip] y_row() 
loop and fetch elements of x from the cache whenever possible. The resulting 
code is shown in Figure A.8.5. Shared memory can also be used to make other 
optimizations, such as fetching Ap[row+1] from an adjacent thread rather than 
refetching it from memory. 

Because the Tesla architecture provides an explicitly managed on-chip shared 
memory, rather than an implicitly active hardware cache, it is fairly common to add 
this sort of optimization. Although this can impose some additional development 
burden on the programmer, it is relatively minor, and the potential performance 
benefits can be substantial. In the example shown above, even this fairly simple 
use of shared memory returns a roughly 20% performance improvement on 
representative matrices derived from 3D surface meshes. The availability of an 
explicitly managed memory in lieu of an implicit cache also has the advantage 
that caching and prefetching policies can be specifically tailored to the application 
needs. 
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— global__ 

void csrmul_cached(unsigned int *Ap, unsigned int *Aj, 
float *Av, unsigned int num_rows, 
const float *x, float *y) 


// Cache the rows of x[] corresponding to this block. 
—_shared__ float cacheLblocksizel; 


unsigned int block_begin = blockIdx.x * blockDim.x; 
unsigned int block_end block_begin + blockDim.x; 
unsigned int row = block_begin + threadIdx.x; 


// Fetch and cache our window of x[]. 
if( row<num_rows) cachelLthreadIdx.x] = x[row]; 
__syncthreads(); 


if( row<num_rows. ) 

{ 
unsigned int row_begin = ApLrow]; 
unsigned int row_end ApLrow+1]; 
float sum = 0, x_j; 


for(unsigned int col=row_begin; col<row_end; ++col) 
{ 
unsigned int j = Aj[col]; 


// Fetch x_j from our cache when possible 
if( j>=block_begin && j<block_end ) 

x_j = cachelLj-block_begin]; 
else 


x_j = xj]; 


sum += Av[col] * xj; 
} 


yLrow] = sum; 


} 


FIGURE A.8.5 Shared memory version of sparse matrix-vector multiply. 
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These are fairly simple kernels whose purpose is to illustrate basic techniques 
in writing CUDA programs, rather than how to achieve maximal performance. 
Numerous possible avenues for optimization are available, several of which are 
explored by Williams, et al. [2007] on a handful of different multicore architectures. 
Nevertheless, it is still instructive to examine the comparative performance of 
even these simplistic kernels. On a 2 GHz Intel Core2 Xeon E5335 processor, 
the csrmul_serial() kernel runs at roughly 202 million nonzeros processed 
per second, for a collection of Laplacian matrices derived from 3D triangulated 
surface meshes. Parallelizing this kernel with the parallel_for construct 
provided by Intel’s Threading Building Blocks produces parallel speed-ups of 2.0, 
2.1, and 2.3 running on two, four, and eight cores of the machine, respectively. 
On a GeForce 8800 Ultra, the csrmul_kernel() and csrmul_cached() 
kernels achieve processing rates of roughly 772 and 920 million nonzeros per 
second, corresponding to parallel speed-ups of 3.8 and 4.6 times over the serial 
performance of a single CPU core. 


Scan and Reduction 


Parallel scan, also known as parallel prefix sum, is one of the most important 
building blocks for data-parallel algorithms [Blelloch, 1990]. Given a sequence a 
of n elements: 


[ao Ay) ++ +> Ap—1] 
and a binary associative operator ®, the scan function computes the sequence: 
scan(a, ®) = [ap (a ® a)),..., (dp B a, ®...® a)l] 


As an example, if we take © to be the usual addition operator, then applying scan 
to the input array 


a=[31704163] 
will produce the sequence of partial sums: 
scan(a, +) = [3 411 11 15 16 22 25] 


This scan operator is an inclusive scan, in the sense that element i of the output 
sequence incorporates element a; of the input. Incorporating only previous 
elements would yield an exclusive scan operator, also known as a prefix-sum 
operation. 

The serial implementation of this operation is extremely simple. It is simply a 
loop that iterates once over the entire sequence, as shown in Figure A.8.6. 

At first glance, it might appear that this operation is inherently serial. However, 
it can actually be implemented in parallel efficiently. The key observation is that 
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template<class T> 
__host__ T plus_scan(T *x, unsigned int n) 
{ 
for(unsigned int i=l; i<n; ++i) 
XIJ = xLi-1] + xlil; 
} 


FIGURE A.8.6 Template for serial plus-scan. 


template<class T> 

__device__ T plus_scan(T *x) 

{ 
unsigned int i = threadIdx.x; 
unsigned int n blockDim.x; 


ll 


for(unsigned int offset=1; offset<n; offset *= 2) 
{ 
T tz 


if(i>=offset) t = xLi-offset]; 
__syncthreads(); 


if(i>d=offset) x[i] = t + x[i]; 
__syncthreads(); 
} 
return x[i]; 
} 


FIGURE A.8.7 CUDA template for parallel plus-scan. 


because addition is associative, we are free to change the order in which elements are 
added together. For instance, we can imagine adding pairs of consecutive elements 
in parallel, and then adding these partial sums, and so on. 

One simple scheme for doing this is from Hillis and Steele [1989]. An 
implementation of their algorithm in CUDA is shown in Figure A.8.7. It assumes 
that the input array x[ ] contains exactly one element per thread of the thread 
block. It performs log, n iterations of a loop collecting partial sums together. 

To understand the action of this loop, consider Figure A.8.8, which illustrates 
the simple case for n= 8 threads and elements. Each level of the diagram represents 
one step of the loop. The lines indicate the location from which the data is being 
fetched. For each element of the output (i.e., the final row of the diagram) we are 
building a summation tree over the input elements. The edges highlighted in blue 
show the form of this summation tree for the final element. The leaves of this tree 
are all the initial elements. Tracing back from any output element shows that it 
incorporates all input values up to and including itself. 
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FIGURE A.8.8 Tree-based parallel scan data references. 


While simple, this algorithm is not as efficient as we would like. Examining 
the serial implementation, we see that it performs O(n) additions. The parallel 
implementation, in contrast, performs O(n log n) additions. For this reason, it 
is not work efficient, since it does more work than the serial implementation to 
compute the same result. Fortunately, there are other techniques for implementing 
scan that are work efficient. Details on more efficient implementation techniques 
and the extension of this per-block procedure to multiblock arrays are provided by 
Sengupta, et al. [2007]. 

In some instances, we may only be interested in computing the sum of all elements 
in an array, rather than the sequence of all prefix sums returned by scan. This is the 
parallel reduction problem. We could simply use a scan algorithm to perform this 
computation, but reduction can generally be implemented more efficiently than scan. 

Figure A.8.9 shows the code for computing a reduction using addition. In 
this example, each thread simply loads one element of the input sequence (i.e., 
it initially sums a subsequence of length 1). At the end of the reduction, we want 
thread 0 to hold the sum of all elements initially loaded by the threads of its block. 
The loop in this kernel implicitly builds a summation tree over the input elements, 
much like the scan algorithm above. 

At the end of this loop, thread 0 holds the sum of all the values loaded by this 
block. If we want the final value of the location pointed to by total to contain the 
total of all elements in the array, we must combine the partial sums of all the blocks 
in the grid. One strategy to do this would be to have each block write its partial 
sum into a second array and then launch the reduction kernel again, repeating 
the process until we had reduced the sequence to a single value. A more attractive 
alternative supported by the Tesla GPU architecture is to use the atomi cAdd() 
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— global__ 
void plus_reduce(int *input, unsigned int N, int *total) 
{ 
unsigned int tid = threadIdx.x; 
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 


// Each block loads its elements into shared memory, padding 


// with 0 if N is not a multiple of blocksize 
—_shared__ int x[blocksize]; 

xCtid] = (i<N) ? inputli] : 0; 
__syncthreads(); 


// Every thread now holds 1 input value in x[] 

// 

// Build summation tree over elements. 

for(int s=blockDim.x/2; s>0; s=s/2) 

{ 
if(tid < s) x[tid] += x[tid + s]; 
__syncthreads(); 

} 


// Thread 0 now holds the sum of all input values 


// to this block. Have it add that sum to the running total 


if( tid == 0 ) atomicAdd(total, x[tid]); 
} 


FIGURE A.8.9 CUDA implementation of plus-reduction. 


primitive, an efficient atomic read-modify-write primitive supported by the 
memory subsystem. This eliminates the need for additional temporary arrays and 
repeated kernel launches. 

Parallel reduction is an essential primitive for parallel programming and 
highlights the importance of per-block shared memory and low-cost barriers in 
making cooperation among threads efficient. This degree of data shuffling among 
threads would be prohibitively expensive if done in off-chip global memory. 


Radix Sort 


One important application of scan primitives is in the implementation of sorting 
routines. The code in Figure A.8.10 implements a radix sort of integers across a 
single thread block. It accepts as input an array values containing one 32-bit 
integer for each thread of the block. For efficiency, this array should be stored in 
per-block shared memory, but this is not required for the sort to behave correctly. 

This is a fairly simple implementation of radix sort. It assumes the availability 
of a procedure partition_by_bit() that will partition the given array such that 
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__device__ void radix_sort(unsigned int *values) 
{ 
for(int bit=0; bit<32; ++bit) 
{ 
partition_by_bit(values, bit); 
__syncthreads(); 


} 


FIGURE A.8.10 CUDA code for radix sort. 


__device__ void partition_by_bit(unsigned int *values, 
unsigned int bit) 
{ 


unsigned int 7 = threadIdx.x; 
unsigned int size = blockDim.x; 
unsigned int x_i = valuesli]; 
unsigned int p_i = (x_i >> bit) & 1; 


valuesli] = p_i; 
__syncthr S 


// Compute number of T bits up to and including p_i. 
// Record the total number of F bits as well. 
unsigned int T_before = plus_scan(values); 


unsigned int T_total = values[size-l]; 
unsigned int F_total = size - T_total; 
__syncthreads(); 


// Write every x_i to its proper place 
if( p_i 3 
values[T_before-1 + F_total] = x_i; 
else 
values[i - T_before] = x_i; 
} 


FIGURE A.8.11 CUDA code to partition data on a bit-by-bit basis, as part of radix sort. 


all values with a 0 in the designated bit will come before all values with a 1 in that 
bit. To produce the correct output, this partitioning must be stable. 

Implementing the partitioning procedure is a simple application of scan. Thread 
i holds the value x; and must calculate the correct output index at which to write 
this value. To do so, it needs to calculate (1) the number of threads j < i for which 
the designated bit is 1 and (2) the total number of bits for which the designated bit 
is 0. The CUDA code for partition_by_bit() is shown in Figure A.8.11. 
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A similar strategy can be applied for implementing a radix sort kernel that sorts 
an array of large length, rather than just a one-block array. The fundamental step 
remains the scan procedure, although when the computation is partitioned across 
multiple kernels, we must double-buffer the array of values rather than doing the 
partitioning in place. Details on performing radix sorts on large arrays efficiently 
are provided by Satish, Harris, and Garland [2008]. 


N-Body Applications on a GPU! 


Nyland, Harris, and Prins [2007] describe a simple yet useful computational 
kernel with excellent GPU performance—the all-pairs N-body algorithm. It is a 
time-consuming component of many scientific applications. N-body simulations 
calculate the evolution of a system of bodies in which each body continuously 
interacts with every other body. One example is an astrophysical simulation in 
which each body represents an individual star, and the bodies gravitationally attract 
each other. Other examples are protein folding, where N-body simulation is used 
to calculate electrostatic and van der Waals forces; turbulent fluid flow simulation; 
and global illumination in computer graphics. 

The all-pairs N-body algorithm calculates the total force on each body in the 
system by computing each pair-wise force in the system, summing for each body. 
Many scientists consider this method to be the most accurate, with the only loss of 
precision coming from the floating-point hardware operations. The drawback is its 
O(n) computational complexity, which is far too large for systems with more than 
10° bodies. To overcome this high cost, several simplifications have been proposed 
to yield O(n log n) and O(n) algorithms; examples are the Barnes-Hut algorithm, 
the Fast Multipole Method and Particle-Mesh-Ewald summation. All of the fast 
methods still rely on the all-pairs method as a kernel for accurate computation of 
short-range forces; thus it continues to be important. 


N-Body Mathematics 


For gravitational simulation, calculate the body-body force using elementary 
physics. Between two bodies indexed by i and j, the 3D force vector is: 


mm, rjj 
fj = 2 x 


The force magnitude is calculated in the left term, while the direction is computed 
in the right (unit vector pointing from one body to the other). 

Given a list of interacting bodies (an entire system or a subset), the calculation 
is simple: for all pairs of interactions, compute the force and sum for each body. 
Once the total forces are calculated, they are used to update each body’s position 
and velocity, based on the previous position and velocity. The calculation of the 
forces has complexity O( n?), while the update is O(n). 


1 Adapted from Nyland, Harris and Prins [2007], “Fast N-Body Simulation with CUDA? 
Chapter 31 of GPU Gems 3. 
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The serial force-calculation code uses two nested for-loops iterating over pairs 
of bodies. The outer loop selects the body for which the total force is being calcu- 
lated, and the inner loop iterates over all the bodies. The inner loop calls a function 
that computes the pair-wise force, then adds the force into a running sum. 

To compute the forces in parallel, we assign one thread to each body, since the 
calculation of force on each body is independent of the calculation on all other 
bodies. Once all of the forces are computed, the positions and velocities of the 
bodies can be updated. 

The code for the serial and parallel versions is shown in Figure A.8.12 and 
Figure A.8.13. The serial version has two nested for-loops. The conversion to CUDA, 
like many other examples, converts the serial outer loop to a per-thread kernel 
where each thread computes the total force on a single body. The CUDA kernel 
computes a global thread ID for each thread, replacing the iterator variable of the 
serial outer loop. Both kernels finish by storing the total acceleration in a global 
array used to compute the new position and velocity values in a subsequent step. 


void accel_on_all_bodies() 
{ 
ine ‘sy. js 
float3 acc(0.0f, 0.0f, 0.0f); 
for (i = 0; i < N; i++) { 
for (j = 0; j < N; j++) { 
acc = body_body_interaction(acc, body[i], body[j]); 


} 
accellLi] = acc; 


} 


FIGURE A.8.12 Serial code to compute all pair-wise forces on N bodies. 


__global__ void accel_on_one_body( ) 


{ 
int ił = threadIdx.x + blockDim.x * blockIdx.x; 


int j; 
float3 acc(0.0f, 0.0f, 0.0f); 


for (J = 0; J < N; g++) { 
acc = body_body_interaction(acc, bodyLi], bodyLj]); 
} 
accelLi] = acc; 
} 


FIGURE A.8.13 CUDA thread code to compute the total force on a single body. 
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The outer loop is replaced by a CUDA kernel grid that launches N threads, one for 
each body. 


Optimization for GPU Execution 


The CUDA code shown is functionally correct, but is not efficient, as it ignores 
key architectural features. Better performance can be achieved with three main 
optimizations. First, shared memory can be used to avoid identical memory reads 
between threads. Second, using multiple threads per body improves performance 
for small values of N. Third, loop unrolling reduces loop overhead. 


Using Shared memory 


Shared memory can hold a subset of body positions, much like a cache, eliminating 
redundant global memory requests between threads. We optimize the code shown 
above to have each of p threads in a thread-block load one position into shared 
memory (for a total of p positions). Once all the threads have loaded a value into 
shared memory, ensured by ___syncthreads(), each thread can then perform p 
interactions (using the data in shared memory). This is repeated N/p times to 
complete the force calculation for each body, which reduces the number of requests 
to memory by a factor of p (typically in the range 32-128). 

The function called acce|]_on_one_body( ) requires a few changes to support 
this optimization. The modified code is shown in Figure A.8.14. 


—_ shared__ float4 shPosition[256]; 


__global__ void accel_on_one_body() 
{ 
int i = threadIidx.x + blockDim.x * blockIdx.x; 
int Jẹ Ki 

int p = blockDim.x; 

float3 acc(0.0f, 0.0f, 0.0f); 

Float4 myBody = bodyli]; 


shPosition[threadIdx.x] = bodyLj+threadIdx.x]; 
__syncthreads(); 


for (k = 0; k < p; k++) { // Inner loop accesses p positions 
acc = body_body_interaction(acc, myBody, shPosition[k]); 


} 
__syncthreads(); 
} 
accelli] = acc; 
} 


for (j = 0; j <N; j += p) { // Outer loops jumps by p each time 


FIGURE A.8.14 CUDA code to compute the total force on each body, using shared memory to improve performance. 
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The loop that formerly iterated over all bodies now jumps by the block 
dimension p. Each iteration of the outer loop loads p successive positions into 
shared memory (one position per thread). The threads synchronize, and then 
p force calculations are computed by each thread. A second synchronization is 
required to ensure that new values are not loaded into shared memory prior to all 
threads completing the force calculations with the current data. 

Using shared memory reduces the memory bandwidth required to less than 
10% of the total bandwidth that the GPU can sustain (using less than 5 GB/s). 
This optimization keeps the application busy performing computation rather than 
waiting on memory accesses, as it would have without the use of shared memory. 
The performance for varying values of N is shown in Figure A.8.15. 


Using Multiple Threads per Body 


Figure A.8.15 shows performance degradation for problems with small values of N 
(N < 4096) on the GeForce 8800 GTX. Many research efforts that rely on N-body 
calculations focus on small N (for long simulation times), making it a target of 
our optimization efforts. Our presumption to explain the lower performance was 
that there was simply not enough work to keep the GPU busy when N is small. 
The solution is to allocate more threads per body. We change the thread-block 
dimensions from (p, 1, 1) to (p, q, 1), where q threads divide the work of a single body 
into equal parts. By allocating the additional threads within the same thread block, 
partial results can be stored in shared memory. When all the force calculations are 
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FIGURE A.8.15 Performance measurements of the N-body application on a GeForce 8800 
GTX and a GeForce 9600. The 8800 has 128 stream processors at 1.35 GHz, while the 9600 has 64 at 
0.80 GHz (about 30% of the 8800). The peak performance is 242 GFLOPS. For a GPU with more processors, 
the problem needs to be bigger to achieve full performance (the 9600 peak is around 2048 bodies, while the 
8800 doesn’t reach its peak until 16,384 bodies). For small N, more than one thread per body can significantly 
improve performance, but eventually incurs a performance penalty as N grows. 
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done, the q partial results can be collected and summed to compute the final result. 
Using two or four threads per body leads to large improvements for small N. 
Asan example, the performance on the 8800 GTX jumps by 110% when N= 1024 
(one thread achieves 90 GFLOPS, where four achieve 190 GFLOPS). Performance 
degrades slightly on large N, so we only use this optimization for N smaller than 
4096. The performance increases are shown in Figure A.8.15 for a GPU with 128 
processors and a smaller GPU with 64 processors clocked at two-thirds the speed. 


Performance Comparison 


The performance of the N-body code is shown in Figure A.8.15 and Figure A.8.16. 
In Figure A.8.15, performance of high- and medium-performance GPUs is shown, 
along with the performance improvements achieved by using multiple threads 
per body. The performance on the faster GPU ranges from 90 to just under 250 
GFLOPS. 

Figure A.8.16 shows nearly identical code (C++ versus CUDA) running on Intel 
Core2 CPUs. The CPU performance is about 1% of the GPU, in the range of 0.2 to 
2 GFLOPS, remaining nearly constant over the wide range of problem sizes. 
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FIGURE A.8.16 Performance measurements on the N-body code on a CPU. The graph shows 
single precision N-body performance using Intel Core2 CPUs, denoted by their CPU model number. Note 
the dramatic reduction in GFLOPS performance (shown in GFLOPS on the y-axis), demonstrating how 
much faster the GPU is compared to the CPU. The performance on the CPU is generally independent of 
problem size, except for an anomalously low performance when N=16,384 on the X9775 CPU. The graph 
also shows the results of running the CUDA version of the code (using the CUDA-for-CPU compiler) on a 
single CPU core, where it outperforms the C++ code by 24%. As a programming language, CUDA exposes 
parallelism and locality that a compiler can exploit. The Intel CPUs are a 3.2 GHz Extreme X9775 (code 
named “Penryn”), a 2.66 GHz E8200 (code named “Wolfdale”), a desktop, pre-Penryn CPU, and a 1.83 GHz 
T2400 (code named “Yonah”), a 2007 laptop CPU. The Penryn version of the Core 2 architecture is particu- 
larly interesting for N-body calculations with its 4-bit divider, allowing division and square root operations 
to execute four times faster than previous Intel CPUs. 
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The graph also shows the results of compiling the CUDA version of the code 
for a CPU, where the performance improves by 24%. CUDA, as a programming 
language, exposes parallelism, allowing the compiler to make better use of the SSE 
vector unit on a single core. The CUDA version of the N-body code naturally maps 
to multicore CPUs as well (with grids of blocks), where it achieves nearly perfect 
scaling on an eight-core system with N = 4096 (ratios of 2.0, 3.97, and 7.94 on two, 
four, and eight cores, respectively). 


Results 


With a modest effort, we developed a computational kernel that improves GPU 
performance over multicore CPUs by a factor of up to 157. Execution time for 
the N-body code running on a recent CPU from Intel (Penryn X9775 at 3.2 GHz, 
single core) took more than 3 seconds per frame to run the same code that runs at a 
44 Hz frame rate on a GeForce 8800 GPU. On pre-Penryn CPUs, the code requires 
6-16 seconds, and on older Core2 processors and Pentium IV processor, the time 
is about 25 seconds. We must divide the apparent increase in performance in half, 
as the CPU requires only half as many calculations to compute the same result 
(using the optimization that the forces on a pair of bodies are equal in strength and 
opposite in direction). 

How can the GPU speed up the code by such a large amount? The answer 
requires inspecting architectural details. The pair-wise force calculation requires 
20 floating-point operations, comprised mostly of addition and multiplication 
instructions (some of which can be combined using a multiply-add instruction), 
but there are also division and square root instructions for vector normaliza- 
tion. Intel CPUs take many cycles for single precision division and square root 
instructions,” although this has improved in the latest Penryn CPU family with 
its faster 4-bit divider.? Additionally, the limitations in register capacity leads to 
many MOV instructions in the x86 code (presumably to/from L1 cache). In con- 
trast, the GeForce 8800 executes a reciprocal square-root thread instruction in four 
clocks; see Section A.6 for special function accuracy. It has a larger register file (per 
thread) and shared memory that can be accessed as an instruction operand. Finally, 
the CUDA compiler emits 15 instructions for one iteration of the loop, compared 
with more than 40 instructions from a variety of x86 CPU compilers. Greater 
parallelism, faster execution of complex instructions, more register space, and an 
efficient compiler all combine to explain the dramatic performance improvement 
of the N-body code between the CPU and the GPU. 


The x86 SSE instructions reciprocal-square-root (RSQRT™) and reciprocal (RCP*) were not 
considered, as their accuracy is too low to be comparable. 

3 Intel Corporation, Intel 64 and IA-32 Architectures Optimization Reference Manual. November 
2007. Order Number: 248966-016. Also available at www3.intel.com/design/processor/manuals/ 
248966.pdf. 
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On a GeForce 8800, the all-pairs N-body algorithm delivers more than 240 
GFLOPS of performance, compared to less than 2 GFLOPS on recent sequential 
processors. Compiling and executing the CUDA version of the code on a CPU 
demonstrates that the problem scales well to multicore CPUs, but is still significantly 
slower than a single GPU. 

We coupled the GPU N-body simulation with a graphical display of the motion, 
and can interactively display 16K bodies interacting at 44 frames per second. 
This allows astrophysical and biophysical events to be displayed and navigated at 
interactive rates. Additionally, we can parameterize many settings, such as noise 
reduction, damping, and integration techniques, immediately displaying their 
effects on the dynamics of the system. This provides scientists with stunning 
visual imagery, boosting their insights on otherwise invisible systems (too large 
or small, too fast or too slow), allowing them to create better models of physical 
phenomena. 

Figure A.8.17 shows a time-series display of an astrophysical simulation 
of 16K bodies, with each body acting as a galaxy. The initial configuration is a 


FIGURE A.8.17 12 images captured during the evolution of an N-body system with 16,384 bodies. 
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spherical shell of bodies rotating about the z-axis. One phenomenon of interest 
to astrophysicists is the clustering that occurs, along with the merging of galaxies 
over time. For the interested reader, the CUDA code for this application is available 
in the CUDA SDK from www.nvidia.com/CUDA. 


Fallacies and Pitfalls 


GPUs have evolved and changed so rapidly that many fallacies and pitfalls have 
arisen. We cover a few here. 


Fallacy: GPUs are just SIMD vector multiprocessors. It is easy to draw the false 
conclusion that GPUs are simply SIMD vector multiprocessors. GPUs do have a 
SPMD-style programming model, in that a programmer can write a single pro- 
gram that is executed in multiple thread instances with multiple data. The execu- 
tion of these threads is not purely SIMD or vector, however; it is single-instruction 
multiple-thread (SIMT), described in Section A.4. Each GPU thread has its own 
scalar registers, thread private memory, thread execution state, thread ID, indepen- 
dent execution and branch path, and effective program counter, and can address 
memory independently. Although a group of threads (e.g., a warp of 32 threads) 
executes more efficiently when the PCs for the threads are the same, this is not 
necessary. So, the multiprocessors are not purely SIMD. The thread execution 
model is MIMD with barrier synchronization and SIMT optimizations. Execution 
is more efficient if individual thread load/store memory accesses can be coalesced 
into block accesses, as well. However, this is not strictly necessary. In a purely SIMD 
vector architecture, memory/register accesses for different threads must be aligned 
in a regular vector pattern. A GPU has no such restriction for register or mem- 
ory accesses; however, execution is more efficient if warps of threads access local 
blocks of data. 

In a further departure from a pure SIMD model, an SIMT GPU can execute 
more than one warp of threads concurrently. In graphics applications, there may 
be multiple groups of vertex programs, pixel programs, and geometry programs 
running in the multiprocessor array concurrently. Computing programs may also 
execute different programs concurrently in different warps. 


Fallacy: GPU performance cannot grow faster than Moore’s law. Moore’s law 
is simply a rate. It is not a “speed of light” limit for any other rate. Moore’s law 
describes an expectation that over time, as semiconductor technology advances 
and transistors become smaller, the manufacturing cost per transistor will decline 
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exponentially. Put another way, given a constant manufacturing cost, the number 
of transistors will increase exponentially. Gordon Moore [1965] predicted that 
this progression would provide roughly two times the number of transistors for 
the same manufacturing cost every year, and later revised it to doubling every 
two years. Although Moore made the initial prediction in 1965 when there were 
just 50 components per integrated circuit, it has proved remarkably consistent. The 
reduction of transistor size has historically had other benefits, such as lower power 
per transistor and faster clock speeds at constant power. 

This increasing bounty of transistors is used by chip architects to build proces- 
sors, memory, and other components. For some time, CPU designers have used 
the extra transistors to increase processor performance at a rate similar to Moore’s 
law, so much so that many people think that processor performance growth of two 
times every 18—24 months is Moore’s law. In fact, it is not. 

Microprocessor designers spend some of the new transistors on processor cores, 
improving the architecture and design, and pipelining for more clock speed. The 
rest of the new transistors are used for providing more cache, to make memory 
access faster. In contrast, GPU designers use almost none of the new transistors to 
provide more cache; most of the transistors are used for improving the processor 
cores and adding more processor cores. 

GPUs get faster by four mechanisms. First, GPU designers reap the Moore’s law 
bounty directly by applying exponentially more transistors to building more parallel, 
and thus faster, processors. Second, GPU designers can improve on the architecture 
over time, increasing the efficiency of the processing. Third, Moore’s law assumes 
constant cost, so the Moore’s law rate can clearly be exceeded by spending more for 
larger chips with more transistors. Fourth, GPU memory systems have increased 
their effective bandwidth at a pace nearly comparable to the processing rate, by 
using faster memories, wider memories, data compression, and better caches. The 
combination of these four approaches has historically allowed GPU performance 
to double regularly, roughly every 12 to 18 months. This rate, exceeding the rate 
of Moore’s law, has been demonstrated on graphics applications for approximately 
ten years and shows no sign of significant slowdown. The most challenging rate 
limiter appears to be the memory system, but competitive innovation is advancing 
that rapidly too. 


Fallacy: GPUs only render 3D graphics; they can’t do general computation. 
GPUs are built to render 3D graphics as well as 2D graphics and video. To meet 
the demands of graphics software developers as expressed in the interfaces and 
performance/feature requirements of the graphics APIs, GPUs have become mas- 
sively parallel programmable floating-point processors. In the graphics domain, 
these processors are programmed through the graphics APIs and with arcane 
graphics programming languages (GLSL, Cg, and HLSL, in OpenGL and Direct3D). 
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However, there is nothing preventing GPU architects from exposing the parallel 
processor cores to programmers without the graphics API or the arcane graphics 
languages. 

In fact, the Tesla architecture family of GPUs exposes the processors through 
a software environment known as CUDA, which allows programmers to develop 
general application programs using the C language and soon C++. GPUs are 
Turing-complete processors, so they can run any program that a CPU can run, 
although perhaps less well. And perhaps faster. 


Fallacy: GPUs cannot run double precision floating-point programs fast. In the 
past, GPUs could not run double precision floating-point programs at all, except 
through software emulation. And that’s not very fast at all. GPUs have made the 
progression from indexed arithmetic representation (lookup tables for colors) to 
8-bit integers per color component, to fixed-point arithmetic, to single precision 
floating-point, and recently added double precision. Modern GPUs perform 
virtually all calculations in single precision IEEE floating-point arithmetic, and are 
beginning to use double precision in addition. 

For a small additional cost, a GPU can support double precision floating-point 
as well as single precision floating-point. Today, double precision runs more slowly 
than the single precision speed, about five to ten times slower. For incremental 
additional cost, double precision performance can be increased relative to single 
precision in stages, as more applications demand it. 


Fallacy: GPUs don’t do floating-point correctly. GPUs, at least in the Tesla archi- 
tecture family of processors, perform single precision floating-point processing at 
a level prescribed by the IEEE 754 floating-point standard. So, in terms of accuracy, 
GPUs are the equal of any other IEEE 754—compliant processors. 

Today, GPUs do not implement some of the specific features described in the 
standard, such as handling denormalized numbers and providing precise floating- 
point exceptions. However, the recently introduced Tesla T10P GPU provides full 
IEEE rounding, fused-multiply-add, and denormalized number support for double 
precision. 


Pitfall: Just use more threads to cover longer memory latencies. CPU cores are 
typically designed to run a single thread at full speed. To run at full speed, every 
instruction and its data need to be available when it is time for that instruction to 
run. If the next instruction is not ready or the data required for that instruction is 
not available, the instruction cannot run and the processor stalls. External memory 
is distant from the processor, so it takes many cycles of wasted execution to fetch 
data from memory. Consequently, CPUs require large local caches to keep running 
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without stalling. Memory latency is long, so it is avoided by striving to run in the 
cache. At some point, program working set demands may be larger than any cache. 
Some CPUs have used multithreading to tolerate latency, but the number of threads 
per core has generally been limited to a small number. 

The GPU strategy is different. GPU cores are designed to run many threads 
concurrently, but only one instruction from any thread at a time. Another way to 
say this is that a GPU runs each thread slowly, but in aggregate runs the threads 
efficiently. Each thread can tolerate some amount of memory latency, because 
other threads can run. 

The downside of this is that multipl—many multiple threads—are required 
to cover the memory latency. In addition, if memory accesses are scattered or not 
correlated among threads, the memory system will get progressively slower in 
responding to each individual request. Eventually, even the multiple threads will 
not be able to cover the latency. So, the pitfall is that for the “just use more threads” 
strategy to work for covering latency, you have to have enough threads, and the 
threads have to be well-behaved in terms of locality of memory access. 


Fallacy: O(n) algorithms are difficult to speed up. No matter how fast the GPU is 
at processing data, the steps of transferring data to and from the device may limit 
the performance of algorithms with O(n) complexity (with a small amount of work 
per datum). The highest transfer rate over the PCIe bus is approximately 48 GB/ 
second when DMA transfers are used, and slightly less for nonDMA transfers. The 
CPU, in contrast, has typical access speeds of 8-12 GB/second to system memory. 
Example problems, such as vector addition, will be limited by the transfer of the 
inputs to the GPU and the returning output from the computation. 

There are three ways to overcome the cost of transferring data. First, try to leave 
the data on the GPU for as long as possible, instead of moving the data back and 
forth for different steps of a complicated algorithm. CUDA deliberately leaves data 
alone in the GPU between launches to support this. 

Second, the GPU supports the concurrent operations of copy-in, copy-out and 
computation, so data can be streamed in and out of the device while it is computing. 
This model is useful for any data stream that can be processed as it arrives. Examples 
are video processing, network routing, data compression/decompression, and even 
simpler computations such as large vector mathematics. 

The third suggestion is to use the CPU and GPU together, improving performance 
by assigning a subset of the work to each, treating the system as a heterogeneous 
computing platform. The CUDA programming model supports allocation of work 
to one or more GPUs along with continued use of the CPU without the use of 
threads (via asynchronous GPU functions), so it is relatively simple to keep all 
GPUs and a CPU working concurrently to solve problems even faster. 
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Concluding Remarks 


GPUs are massively parallel processors and have become widely used, not only 
for 3D graphics, but also for many other applications. This wide application 
was made possible by the evolution of graphics devices into programmable 
processors. The graphics application programming model for GPUs is usually an 
API such as DirectX™ or OpenGL™. For more general-purpose computing, the 
CUDA programming model uses an SPMD (single-program multiple data) style, 
executing a program with many parallel threads. 

GPU parallelism will continue to scale with Moore’s law, mainly by increasing 
the number of processors. Only the parallel programming models that can readily 
scale to hundreds of processor cores and thousands of threads will be successful 
in supporting manycore GPUs and CPUs. Also, only those applications that have 
many largely independent parallel tasks will be accelerated by massively parallel 
manycore architectures. 

Parallel programming models for GPUs are becoming more flexible, for both 
graphics and parallel computing. For example, CUDA is evolving rapidly in the 
direction of full C/C++ functionality. Graphics APIs and programming models will 
likely adapt parallel computing capabilities and models from CUDA. Its SPMD- 
style threading model is scalable, and is a convenient, succinct, and easily learned 
model for expressing large amounts of parallelism. 

Driven by these changes in the programming models, GPU architecture is in 
turn becoming more flexible and more programmable. GPU fixed-function units 
are becoming accessible from general programs, along the lines of how CUDA 
programs already use texture intrinsic functions to perform texture lookups using 
the GPU texture instruction and texture unit. 

GPU architecture will continue to adapt to the usage patterns of both graphics 
and other application programmers. GPUs will continue to expand to include more 
processing power through additional processor cores, as well as increasing the thread 
and memory bandwidth available for programs. In addition, the programming 
models must evolve to include programming heterogeneous manycore systems 
including both GPUs and CPUs. 
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This section, which appears in the Historical Perspective and Further Reading section 
of the CD, surveys the history of programmable real-time graphics processing units 
(GPUs) from the early 1980s through today as they declined in price by two orders 
of magnitude and increased in performance by two orders of magnitude. It traces 
the evolution of the GPU from fixed function pipelines to programmable graphics 
processors, with perspectives on GPU computing, unified graphics and computing 
processors, visual computing, and scalable GPUs. 


